linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA
@ 2021-03-11 23:31 Logan Gunthorpe
  2021-03-11 23:31 ` [RFC PATCH v2 01/11] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn() Logan Gunthorpe
                   ` (11 more replies)
  0 siblings, 12 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-11 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Logan Gunthorpe

Hi,

This is a rework of the first half of my RFC for doing P2PDMA in userspace
with O_DIRECT[1].

The largest issue with that series was the gross way of flagging P2PDMA
SGL segments. This RFC proposes a different approach, (suggested by
Dan Williams[2]) which uses the third bit in the page_link field of the
SGL.

This approach is a lot less hacky but comes at the cost of adding a
CONFIG_64BIT dependency to CONFIG_PCI_P2PDMA and using up the last
scarce bit in the page_link. For our purposes, a 64BIT restriction is
acceptable but it's not clear if this is ok for all usecases hoping
to make use of P2PDMA.

Matthew Wilcox has already suggested (off-list) that this is the wrong
approach, preferring a new dma mapping operation and an SGL replacement. I
don't disagree that something along those lines would be a better long
term solution, but it involves overcoming a lot of challenges to get
there. Creating a new mapping operation still means adding support to more
than 25 dma_map_ops implementations (many of which are on obscure
architectures) or creating a redundant path to fallback with dma_map_sg()
for every driver that uses the new operation. This RFC is an approach
that doesn't require overcoming these blocks.

Any alternative ideas or feedback is welcome.

These patches are based on v5.12-rc2 and a git branch is available here:

  https://github.com/sbates130272/linux-p2pmem/  p2pdma_dma_map_ops_rfc

A branch with the patches from the previous RFC that add userspace
O_DIRECT support is available at the same URL with the name
"p2pdma_dma_map_ops_rfc+user" (however, none of the issues with those
extra patches from the feedback of the last posting have been fixed).

Thanks,

Logan

[1] https://lore.kernel.org/linux-block/20201106170036.18713-1-logang@deltatee.com/
[2] https://lore.kernel.org/linux-block/CAPcyv4ifGcrdOtUt8qr7pmFhmecGHqGVre9G0RorGczCGVECQQ@mail.gmail.com/

--

Logan Gunthorpe (11):
  PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()
  PCI/P2PDMA: Avoid pci_get_slot() which sleeps
  PCI/P2PDMA: Attempt to set map_type if it has not been set
  PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and
    pci_p2pdma_bus_offset()
  lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
  iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
  block: Add BLK_STS_P2PDMA
  nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
  nvme-pci: Convert to using dma_map_sg for p2pdma pages

 block/blk-core.c            |  2 +
 drivers/iommu/dma-iommu.c   | 63 +++++++++++++++++++++-----
 drivers/nvme/host/core.c    |  3 +-
 drivers/nvme/host/nvme.h    |  2 +-
 drivers/nvme/host/pci.c     | 38 +++++++---------
 drivers/pci/Kconfig         |  2 +-
 drivers/pci/p2pdma.c        | 89 +++++++++++++++++++++++++++++++------
 include/linux/blk_types.h   |  7 +++
 include/linux/dma-map-ops.h |  3 ++
 include/linux/dma-mapping.h |  5 +++
 include/linux/pci-p2pdma.h  | 11 +++++
 include/linux/scatterlist.h | 49 ++++++++++++++++++--
 kernel/dma/direct.c         | 35 +++++++++++++--
 kernel/dma/mapping.c        | 21 +++++++--
 14 files changed, 271 insertions(+), 59 deletions(-)


base-commit: a38fd8748464831584a19438cbb3082b5a2dab15
--
2.20.1

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 01/11] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()
  2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
@ 2021-03-11 23:31 ` Logan Gunthorpe
  2021-03-12 20:39   ` Bjorn Helgaas
  2021-03-11 23:31 ` [RFC PATCH v2 02/11] PCI/P2PDMA: Avoid pci_get_slot() which sleeps Logan Gunthorpe
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-11 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Logan Gunthorpe

In order to call this function from a dma_map function, it must not sleep.
The only reason it does sleep so to allocate the seqbuf to print
which devices are within the ACS path.

Switch the kmalloc call to use a passed in gfp_mask  and don't print that
message if the buffer fails to be allocated.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 196382630363..bd89437faf06 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -267,7 +267,7 @@ static int pci_bridge_has_acs_redir(struct pci_dev *pdev)
 
 static void seq_buf_print_bus_devfn(struct seq_buf *buf, struct pci_dev *pdev)
 {
-	if (!buf)
+	if (!buf || !buf->buffer)
 		return;
 
 	seq_buf_printf(buf, "%s;", pci_name(pdev));
@@ -495,25 +495,26 @@ upstream_bridge_distance(struct pci_dev *provider, struct pci_dev *client,
 
 static enum pci_p2pdma_map_type
 upstream_bridge_distance_warn(struct pci_dev *provider, struct pci_dev *client,
-			      int *dist)
+			      int *dist, gfp_t gfp_mask)
 {
 	struct seq_buf acs_list;
 	bool acs_redirects;
 	int ret;
 
-	seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, GFP_KERNEL), PAGE_SIZE);
-	if (!acs_list.buffer)
-		return -ENOMEM;
+	seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, gfp_mask), PAGE_SIZE);
 
 	ret = upstream_bridge_distance(provider, client, dist, &acs_redirects,
 				       &acs_list);
 	if (acs_redirects) {
 		pci_warn(client, "ACS redirect is set between the client and provider (%s)\n",
 			 pci_name(provider));
-		/* Drop final semicolon */
-		acs_list.buffer[acs_list.len-1] = 0;
-		pci_warn(client, "to disable ACS redirect for this path, add the kernel parameter: pci=disable_acs_redir=%s\n",
-			 acs_list.buffer);
+
+		if (acs_list.buffer) {
+			/* Drop final semicolon */
+			acs_list.buffer[acs_list.len - 1] = 0;
+			pci_warn(client, "to disable ACS redirect for this path, add the kernel parameter: pci=disable_acs_redir=%s\n",
+				 acs_list.buffer);
+		}
 	}
 
 	if (ret == PCI_P2PDMA_MAP_NOT_SUPPORTED) {
@@ -566,7 +567,7 @@ int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
 
 		if (verbose)
 			ret = upstream_bridge_distance_warn(provider,
-					pci_client, &distance);
+					pci_client, &distance, GFP_KERNEL);
 		else
 			ret = upstream_bridge_distance(provider, pci_client,
 						       &distance, NULL, NULL);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 02/11] PCI/P2PDMA: Avoid pci_get_slot() which sleeps
  2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
  2021-03-11 23:31 ` [RFC PATCH v2 01/11] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn() Logan Gunthorpe
@ 2021-03-11 23:31 ` Logan Gunthorpe
  2021-03-12 20:57   ` Bjorn Helgaas
  2021-03-11 23:31 ` [RFC PATCH v2 03/11] PCI/P2PDMA: Attempt to set map_type if it has not been set Logan Gunthorpe
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-11 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Logan Gunthorpe

In order to use upstream_bridge_distance_warn() from a dma_map function,
it must not sleep. However, pci_get_slot() takes the pci_bus_sem so it
might sleep.

In order to avoid this, try to get the host bridge's device from
bus->self, and if that is not set just get the first element in the
list. It should be impossible for the host bridges device to go away
while references are held on child devices, so the first element
should not change and this should be safe.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index bd89437faf06..2135fe69bb07 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -311,11 +311,15 @@ static const struct pci_p2pdma_whitelist_entry {
 static bool __host_bridge_whitelist(struct pci_host_bridge *host,
 				    bool same_host_bridge)
 {
-	struct pci_dev *root = pci_get_slot(host->bus, PCI_DEVFN(0, 0));
 	const struct pci_p2pdma_whitelist_entry *entry;
+	struct pci_dev *root = host->bus->self;
 	unsigned short vendor, device;
 
 	if (!root)
+		root = list_first_entry_or_null(&host->bus->devices,
+						struct pci_dev, bus_list);
+
+	if (!root || root->devfn)
 		return false;
 
 	vendor = root->vendor;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 03/11] PCI/P2PDMA: Attempt to set map_type if it has not been set
  2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
  2021-03-11 23:31 ` [RFC PATCH v2 01/11] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn() Logan Gunthorpe
  2021-03-11 23:31 ` [RFC PATCH v2 02/11] PCI/P2PDMA: Avoid pci_get_slot() which sleeps Logan Gunthorpe
@ 2021-03-11 23:31 ` Logan Gunthorpe
  2021-03-11 23:31 ` [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset() Logan Gunthorpe
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-11 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Logan Gunthorpe

Attempt to find the mapping type for P2PDMA pages on the first
DMA map attempt if it has not been done ahead of time.

Previously, the mapping type was expected to be calculated ahead of
time, but if pages are to come from userspace then there's no
way to ensure the path was checked ahead of time.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 2135fe69bb07..7f6836732bce 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -819,11 +819,18 @@ EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct pci_dev *provider,
 						    struct pci_dev *client)
 {
+	enum pci_p2pdma_map_type ret;
+
 	if (!provider->p2pdma)
 		return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 
-	return xa_to_value(xa_load(&provider->p2pdma->map_types,
-				   map_types_idx(client)));
+	ret = xa_to_value(xa_load(&provider->p2pdma->map_types,
+				  map_types_idx(client)));
+	if (ret != PCI_P2PDMA_MAP_UNKNOWN)
+		return ret;
+
+	return upstream_bridge_distance_warn(provider, client, NULL,
+					     GFP_ATOMIC);
 }
 
 static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
@@ -871,7 +878,6 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
 	case PCI_P2PDMA_MAP_BUS_ADDR:
 		return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
 	default:
-		WARN_ON_ONCE(1);
 		return 0;
 	}
 }
-- 
2.20.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset()
  2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
                   ` (2 preceding siblings ...)
  2021-03-11 23:31 ` [RFC PATCH v2 03/11] PCI/P2PDMA: Attempt to set map_type if it has not been set Logan Gunthorpe
@ 2021-03-11 23:31 ` Logan Gunthorpe
  2021-03-13  1:38   ` Ira Weiny
                     ` (2 more replies)
  2021-03-11 23:31 ` [RFC PATCH v2 05/11] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
                   ` (7 subsequent siblings)
  11 siblings, 3 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-11 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Logan Gunthorpe

Introduce pci_p2pdma_should_map_bus() which is meant to be called by
DMA map functions to determine how to map a given p2pdma page.

pci_p2pdma_bus_offset() is also added to allow callers to get the bus
offset if they need to map the bus address.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c       | 50 ++++++++++++++++++++++++++++++++++++++
 include/linux/pci-p2pdma.h | 11 +++++++++
 2 files changed, 61 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 7f6836732bce..66d16b7eb668 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -912,6 +912,56 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
 
+/**
+ * pci_p2pdma_bus_offset - returns the bus offset for a given page
+ * @page: page to get the offset for
+ *
+ * Must be passed a PCI p2pdma page.
+ */
+u64 pci_p2pdma_bus_offset(struct page *page)
+{
+	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page->pgmap);
+
+	WARN_ON(!is_pci_p2pdma_page(page));
+
+	return p2p_pgmap->bus_offset;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_bus_offset);
+
+/**
+ * pci_p2pdma_dma_map_type - determine if a DMA mapping should use the
+ *	bus address, be mapped normally or fail
+ * @dev: device doing the DMA request
+ * @pgmap: dev_pagemap structure for the mapping
+ *
+ * Returns:
+ *    1 - if the page should be mapped with a bus address,
+ *    0 - if the page should be mapped normally through an IOMMU mapping or
+ *        physical address; or
+ *   -1 - if the device should not map the pages and an error should be
+ *        returned
+ */
+int pci_p2pdma_dma_map_type(struct device *dev, struct dev_pagemap *pgmap)
+{
+	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(pgmap);
+	struct pci_dev *client;
+
+	if (!dev_is_pci(dev))
+		return -1;
+
+	client = to_pci_dev(dev);
+
+	switch (pci_p2pdma_map_type(p2p_pgmap->provider, client)) {
+	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+		return 0;
+	case PCI_P2PDMA_MAP_BUS_ADDR:
+		return 1;
+	default:
+		return -1;
+	}
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_dma_map_type);
+
 /**
  * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
  *		to enable p2pdma
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..3d55ad19f54c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -34,6 +34,8 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs);
 void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs);
+u64 pci_p2pdma_bus_offset(struct page *page);
+int pci_p2pdma_dma_map_type(struct device *dev, struct dev_pagemap *pgmap);
 int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 			    bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -83,6 +85,15 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline u64 pci_p2pdma_bus_offset(struct page *page)
+{
+	return -1;
+}
+static inline int pci_p2pdma_dma_map_type(struct device *dev,
+					  struct dev_pagemap *pgmap)
+{
+	return -1;
+}
 static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
 		struct scatterlist *sg, int nents, enum dma_data_direction dir,
 		unsigned long attrs)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 05/11] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
                   ` (3 preceding siblings ...)
  2021-03-11 23:31 ` [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset() Logan Gunthorpe
@ 2021-03-11 23:31 ` Logan Gunthorpe
  2021-03-11 23:31 ` [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-11 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Logan Gunthorpe

Make use of the third free LSB in scatterlist's page_link on 64bit systems.

The extra bit will be used by dma_[un]map_sg() to determine when a
given SGL segments dma_address points to a PCI bus address.
dma_unmap_sg() will need to perform different cleanup when this is the
case.

Using this bit requires adding an additional dependency on CONFIG_64BIT to
CONFIG_PCI_P2PDMA. This should be acceptable as the majority of P2PDMA
use cases are restricted to newer root complexes and roughly require the
extra address space for memory BARs used in the transactions.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig         |  2 +-
 include/linux/scatterlist.h | 49 ++++++++++++++++++++++++++++++++++---
 2 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 0c473d75e625..90b4bddb3300 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -163,7 +163,7 @@ config PCI_PASID
 
 config PCI_P2PDMA
 	bool "PCI peer-to-peer transfer support"
-	depends on ZONE_DEVICE
+	depends on ZONE_DEVICE && 64BIT
 	select GENERIC_ALLOCATOR
 	help
 	  Enableѕ drivers to do PCI peer-to-peer transactions to and from
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 6f70572b2938..5525d3ebf36f 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -58,6 +58,21 @@ struct sg_table {
 #define SG_CHAIN	0x01UL
 #define SG_END		0x02UL
 
+/*
+ * bit 2 is the third free bit in the page_link on 64bit systems which
+ * is used by dma_unmap_sg() to determine if the dma_address is a PCI
+ * bus address when doing P2PDMA.
+ * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
+ */
+
+#ifdef CONFIG_PCI_P2PDMA
+#define SG_PCI_P2PDMA	0x04UL
+#else
+#define SG_PCI_P2PDMA	0x00UL
+#endif
+
+#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_PCI_P2PDMA)
+
 /*
  * We overload the LSB of the page pointer to indicate whether it's
  * a valid sg entry, or whether it points to the start of a new scatterlist.
@@ -65,8 +80,9 @@ struct sg_table {
  */
 #define sg_is_chain(sg)		((sg)->page_link & SG_CHAIN)
 #define sg_is_last(sg)		((sg)->page_link & SG_END)
+#define sg_is_pci_p2pdma(sg)	((sg)->page_link & SG_PCI_P2PDMA)
 #define sg_chain_ptr(sg)	\
-	((struct scatterlist *) ((sg)->page_link & ~(SG_CHAIN | SG_END)))
+	((struct scatterlist *) ((sg)->page_link & ~SG_PAGE_LINK_MASK))
 
 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -80,13 +96,13 @@ struct sg_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
 {
-	unsigned long page_link = sg->page_link & (SG_CHAIN | SG_END);
+	unsigned long page_link = sg->page_link & SG_PAGE_LINK_MASK;
 
 	/*
 	 * In order for the low bit stealing approach to work, pages
 	 * must be aligned at a 32-bit boundary as a minimum.
 	 */
-	BUG_ON((unsigned long) page & (SG_CHAIN | SG_END));
+	BUG_ON((unsigned long) page & SG_PAGE_LINK_MASK);
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg_is_chain(sg));
 #endif
@@ -120,7 +136,7 @@ static inline struct page *sg_page(struct scatterlist *sg)
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg_is_chain(sg));
 #endif
-	return (struct page *)((sg)->page_link & ~(SG_CHAIN | SG_END));
+	return (struct page *)((sg)->page_link & ~SG_PAGE_LINK_MASK);
 }
 
 /**
@@ -222,6 +238,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
 	sg->page_link &= ~SG_END;
 }
 
+/**
+ * sg_mark_pci_p2pdma - Mark the scatterlist entry for PCI p2pdma
+ * @sg:		 SG entryScatterlist
+ *
+ * Description:
+ *   Marks the passed in sg entry to indicate that the dma_address is
+ *   a PCI bus address.
+ **/
+static inline void sg_mark_pci_p2pdma(struct scatterlist *sg)
+{
+	sg->page_link |= SG_PCI_P2PDMA;
+}
+
+/**
+ * sg_unmark_pci_p2pdma - Unmark the scatterlist entry for PCI p2pdma
+ * @sg:		 SG entryScatterlist
+ *
+ * Description:
+ *   Clears the PCI P2PDMA mark
+ **/
+static inline void sg_unmark_pci_p2pdma(struct scatterlist *sg)
+{
+	sg->page_link &= ~SG_PCI_P2PDMA;
+}
+
 /**
  * sg_phys - Return physical address of an sg entry
  * @sg:	     SG entry
-- 
2.20.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
                   ` (4 preceding siblings ...)
  2021-03-11 23:31 ` [RFC PATCH v2 05/11] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
@ 2021-03-11 23:31 ` Logan Gunthorpe
  2021-03-12 15:52   ` Robin Murphy
  2021-03-16  8:11   ` Christoph Hellwig
  2021-03-11 23:31 ` [RFC PATCH v2 07/11] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-11 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Logan Gunthorpe

Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

SGL segments that contain PCI bus addresses are marked with
sg_mark_pci_p2pdma() and are ignored when unmapped.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 kernel/dma/direct.c  | 35 ++++++++++++++++++++++++++++++++---
 kernel/dma/mapping.c | 13 ++++++++++---
 2 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 002268262c9a..f326d32062dd 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -13,6 +13,7 @@
 #include <linux/vmalloc.h>
 #include <linux/set_memory.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include "direct.h"
 
 /*
@@ -387,19 +388,47 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
 	struct scatterlist *sg;
 	int i;
 
-	for_each_sg(sgl, sg, nents, i)
+	for_each_sg(sgl, sg, nents, i) {
+		if (sg_is_pci_p2pdma(sg))
+			continue;
+
 		dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
 			     attrs);
+	}
 }
 #endif
 
 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 		enum dma_data_direction dir, unsigned long attrs)
 {
-	int i;
+	struct dev_pagemap *pgmap = NULL;
+	int i, map = -1, ret = 0;
 	struct scatterlist *sg;
+	u64 bus_off;
 
 	for_each_sg(sgl, sg, nents, i) {
+		if (is_pci_p2pdma_page(sg_page(sg))) {
+			if (sg_page(sg)->pgmap != pgmap) {
+				pgmap = sg_page(sg)->pgmap;
+				map = pci_p2pdma_dma_map_type(dev, pgmap);
+				bus_off = pci_p2pdma_bus_offset(sg_page(sg));
+			}
+
+			if (map < 0) {
+				sg->dma_address = DMA_MAPPING_ERROR;
+				ret = -EREMOTEIO;
+				goto out_unmap;
+			}
+
+			if (map) {
+				sg->dma_address = sg_phys(sg) + sg->offset -
+					bus_off;
+				sg_dma_len(sg) = sg->length;
+				sg_mark_pci_p2pdma(sg);
+				continue;
+			}
+		}
+
 		sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
 				sg->offset, sg->length, dir, attrs);
 		if (sg->dma_address == DMA_MAPPING_ERROR)
@@ -411,7 +440,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 
 out_unmap:
 	dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-	return 0;
+	return ret;
 }
 
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b6a633679933..adc1a83950be 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -178,8 +178,15 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
 EXPORT_SYMBOL(dma_unmap_page_attrs);
 
 /*
- * dma_maps_sg_attrs returns 0 on error and > 0 on success.
- * It should never return a value < 0.
+ * dma_maps_sg_attrs returns 0 on any resource error and > 0 on success.
+ *
+ * If 0 is returned, the mapping can be retried and will succeed once
+ * sufficient resources are available.
+ *
+ * If there are P2PDMA pages in the scatterlist then this function may
+ * return -EREMOTEIO to indicate that the pages are not mappable by the
+ * device. In this case, an error should be returned for the IO as it
+ * will never be successfully retried.
  */
 int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
 		enum dma_data_direction dir, unsigned long attrs)
@@ -197,7 +204,7 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
 		ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
 	else
 		ents = ops->map_sg(dev, sg, nents, dir, attrs);
-	BUG_ON(ents < 0);
+
 	debug_dma_map_sg(dev, sg, nents, ents, dir);
 
 	return ents;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 07/11] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
  2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
                   ` (5 preceding siblings ...)
  2021-03-11 23:31 ` [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
@ 2021-03-11 23:31 ` Logan Gunthorpe
  2021-03-13  2:36   ` Ira Weiny
  2021-03-16  8:15   ` Christoph Hellwig
  2021-03-11 23:31 ` [RFC PATCH v2 08/11] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg Logan Gunthorpe
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-11 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Logan Gunthorpe

Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.

Also, add a helper to check if a device supports PCI P2PDMA.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/dma-map-ops.h | 3 +++
 include/linux/dma-mapping.h | 5 +++++
 kernel/dma/mapping.c        | 8 ++++++++
 3 files changed, 16 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 51872e736e7b..481892822104 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -12,6 +12,9 @@
 struct cma;
 
 struct dma_map_ops {
+	unsigned int flags;
+#define DMA_F_PCI_P2PDMA_SUPPORTED     (1 << 0)
+
 	void *(*alloc)(struct device *dev, size_t size,
 			dma_addr_t *dma_handle, gfp_t gfp,
 			unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 2a984cb4d1e0..a6bced004de8 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -138,6 +138,7 @@ int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
 		unsigned long attrs);
 bool dma_can_mmap(struct device *dev);
 int dma_supported(struct device *dev, u64 mask);
+int dma_pci_p2pdma_supported(struct device *dev);
 int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
@@ -233,6 +234,10 @@ static inline int dma_supported(struct device *dev, u64 mask)
 {
 	return 0;
 }
+static inline int dma_pci_p2pdma_supported(struct device *dev)
+{
+	return 0;
+}
 static inline int dma_set_mask(struct device *dev, u64 mask)
 {
 	return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index adc1a83950be..cc1b97deb74d 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -540,6 +540,14 @@ int dma_supported(struct device *dev, u64 mask)
 }
 EXPORT_SYMBOL(dma_supported);
 
+int dma_pci_p2pdma_supported(struct device *dev)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	return !ops || ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
+}
+EXPORT_SYMBOL(dma_pci_p2pdma_supported);
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
 void arch_dma_set_mask(struct device *dev, u64 mask);
 #else
-- 
2.20.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 08/11] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
  2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
                   ` (6 preceding siblings ...)
  2021-03-11 23:31 ` [RFC PATCH v2 07/11] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
@ 2021-03-11 23:31 ` Logan Gunthorpe
  2021-03-12 15:52   ` Robin Murphy
  2021-03-11 23:31 ` [RFC PATCH v2 09/11] block: Add BLK_STS_P2PDMA Logan Gunthorpe
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-11 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Logan Gunthorpe

When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.

Similar to dma-direct, the sg_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is
set to DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for
P2PDMA pages.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/iommu/dma-iommu.c | 63 ++++++++++++++++++++++++++++++++-------
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c813cc8..c0821e9051a9 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -20,6 +20,7 @@
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/swiotlb.h>
 #include <linux/scatterlist.h>
 #include <linux/vmalloc.h>
@@ -846,7 +847,7 @@ static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
  * segment's start address to avoid concatenating across one.
  */
 static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
-		dma_addr_t dma_addr)
+		dma_addr_t dma_addr, unsigned long attrs)
 {
 	struct scatterlist *s, *cur = sg;
 	unsigned long seg_mask = dma_get_seg_boundary(dev);
@@ -864,6 +865,20 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
 		sg_dma_address(s) = DMA_MAPPING_ERROR;
 		sg_dma_len(s) = 0;
 
+		if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
+			if (i > 0)
+				cur = sg_next(cur);
+
+			sg_dma_address(cur) = sg_phys(s) + s->offset -
+				pci_p2pdma_bus_offset(sg_page(s));
+			sg_dma_len(cur) = s->length;
+			sg_mark_pci_p2pdma(cur);
+
+			count++;
+			cur_len = 0;
+			continue;
+		}
+
 		/*
 		 * Now fill in the real DMA data. If...
 		 * - there is a valid output segment to append to
@@ -960,11 +975,12 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	struct iova_domain *iovad = &cookie->iovad;
 	struct scatterlist *s, *prev = NULL;
+	struct dev_pagemap *pgmap = NULL;
 	int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
 	dma_addr_t iova;
 	size_t iova_len = 0;
 	unsigned long mask = dma_get_seg_boundary(dev);
-	int i;
+	int i, map = -1, ret = 0;
 
 	if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
 	    iommu_deferred_attach(dev, domain))
@@ -993,6 +1009,23 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		s_length = iova_align(iovad, s_length + s_iova_off);
 		s->length = s_length;
 
+		if (is_pci_p2pdma_page(sg_page(s))) {
+			if (sg_page(s)->pgmap != pgmap) {
+				pgmap = sg_page(s)->pgmap;
+				map = pci_p2pdma_dma_map_type(dev, pgmap);
+			}
+
+			if (map < 0) {
+				ret = -EREMOTEIO;
+				goto out_restore_sg;
+			}
+
+			if (map) {
+				s->length = 0;
+				continue;
+			}
+		}
+
 		/*
 		 * Due to the alignment of our single IOVA allocation, we can
 		 * depend on these assumptions about the segment boundary mask:
@@ -1015,6 +1048,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		prev = s;
 	}
 
+	if (!iova_len)
+		return __finalise_sg(dev, sg, nents, 0, attrs);
+
 	iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
 	if (!iova)
 		goto out_restore_sg;
@@ -1026,19 +1062,19 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	if (iommu_map_sg_atomic(domain, iova, sg, nents, prot) < iova_len)
 		goto out_free_iova;
 
-	return __finalise_sg(dev, sg, nents, iova);
+	return __finalise_sg(dev, sg, nents, iova, attrs);
 
 out_free_iova:
 	iommu_dma_free_iova(cookie, iova, iova_len, NULL);
 out_restore_sg:
 	__invalidate_sg(sg, nents);
-	return 0;
+	return ret;
 }
 
 static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs)
 {
-	dma_addr_t start, end;
+	dma_addr_t end, start = DMA_MAPPING_ERROR;
 	struct scatterlist *tmp;
 	int i;
 
@@ -1054,14 +1090,20 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 	 * The scatterlist segments are mapped into a single
 	 * contiguous IOVA allocation, so this is incredibly easy.
 	 */
-	start = sg_dma_address(sg);
-	for_each_sg(sg_next(sg), tmp, nents - 1, i) {
+	for_each_sg(sg, tmp, nents, i) {
+		if (sg_is_pci_p2pdma(tmp))
+			continue;
 		if (sg_dma_len(tmp) == 0)
 			break;
-		sg = tmp;
+
+		if (start == DMA_MAPPING_ERROR)
+			start = sg_dma_address(tmp);
+
+		end = sg_dma_address(tmp) + sg_dma_len(tmp);
 	}
-	end = sg_dma_address(sg) + sg_dma_len(sg);
-	__iommu_dma_unmap(dev, start, end - start);
+
+	if (start != DMA_MAPPING_ERROR)
+		__iommu_dma_unmap(dev, start, end - start);
 }
 
 static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
@@ -1254,6 +1296,7 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
 }
 
 static const struct dma_map_ops iommu_dma_ops = {
+	.flags			= DMA_F_PCI_P2PDMA_SUPPORTED,
 	.alloc			= iommu_dma_alloc,
 	.free			= iommu_dma_free,
 	.alloc_pages		= dma_common_alloc_pages,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 09/11] block: Add BLK_STS_P2PDMA
  2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
                   ` (7 preceding siblings ...)
  2021-03-11 23:31 ` [RFC PATCH v2 08/11] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg Logan Gunthorpe
@ 2021-03-11 23:31 ` Logan Gunthorpe
  2021-03-16  8:00   ` Christoph Hellwig
  2021-03-11 23:31 ` [RFC PATCH v2 10/11] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA Logan Gunthorpe
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-11 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Logan Gunthorpe

Create a specific error code for when P2PDMA pages are passed to a block
devices that cannot map them (due to no IOMMU support or ACS protections).

This makes request errors in these cases more informative of as to what
caused the error.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk-core.c          | 2 ++
 include/linux/blk_types.h | 7 +++++++
 2 files changed, 9 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..2cc75b56ac43 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -192,6 +192,8 @@ static const struct {
 	[BLK_STS_ZONE_OPEN_RESOURCE]	= { -ETOOMANYREFS, "open zones exceeded" },
 	[BLK_STS_ZONE_ACTIVE_RESOURCE]	= { -EOVERFLOW, "active zones exceeded" },
 
+	[BLK_STS_P2PDMA] = { -EREMOTEIO, "P2PDMA to invalid device" },
+
 	/* everything else not covered above: */
 	[BLK_STS_IOERR]		= { -EIO,	"I/O" },
 };
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db026b6ec15a..728a0898cb34 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -142,6 +142,13 @@ typedef u8 __bitwise blk_status_t;
  */
 #define BLK_STS_ZONE_ACTIVE_RESOURCE	((__force blk_status_t)16)
 
+/*
+ * BLK_STS_P2PDMA is returned from the driver if P2PDMA memory fails to be
+ * mapped to the target device. This is a permanent error and the request
+ * should not be retried.
+ */
+#define BLK_STS_P2PDMA		((__force blk_status_t)17)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
-- 
2.20.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 10/11] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
  2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
                   ` (8 preceding siblings ...)
  2021-03-11 23:31 ` [RFC PATCH v2 09/11] block: Add BLK_STS_P2PDMA Logan Gunthorpe
@ 2021-03-11 23:31 ` Logan Gunthorpe
  2021-03-11 23:31 ` [RFC PATCH v2 11/11] nvme-pci: Convert to using dma_map_sg for p2pdma pages Logan Gunthorpe
  2021-03-12 15:51 ` [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Robin Murphy
  11 siblings, 0 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-11 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Logan Gunthorpe

Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
flags can be checked for PCI P2PDMA support.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  2 +-
 drivers/nvme/host/pci.c  | 11 +++++++++--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e68a8c4ac5a6..501b4a979776 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3928,7 +3928,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
 		blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
 
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
-	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+	if (ctrl->ops->supports_pci_p2pdma &&
+	    ctrl->ops->supports_pci_p2pdma(ctrl))
 		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
 
 	ns->queue->queuedata = ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 07b34175c6ce..9c04df982d2c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -473,7 +473,6 @@ struct nvme_ctrl_ops {
 	unsigned int flags;
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
-#define NVME_F_PCI_P2PDMA		(1 << 2)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -481,6 +480,7 @@ struct nvme_ctrl_ops {
 	void (*submit_async_event)(struct nvme_ctrl *ctrl);
 	void (*delete_ctrl)(struct nvme_ctrl *ctrl);
 	int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+	bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
 };
 
 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17ab3320d28b..7d40c6a9e58e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2759,17 +2759,24 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 	return snprintf(buf, size, "%s\n", dev_name(&pdev->dev));
 }
 
+static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
+{
+	struct nvme_dev *dev = to_nvme_dev(ctrl);
+
+	return dma_pci_p2pdma_supported(dev->dev);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
-	.flags			= NVME_F_METADATA_SUPPORTED |
-				  NVME_F_PCI_P2PDMA,
+	.flags			= NVME_F_METADATA_SUPPORTED,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
 	.free_ctrl		= nvme_pci_free_ctrl,
 	.submit_async_event	= nvme_pci_submit_async_event,
 	.get_address		= nvme_pci_get_address,
+	.supports_pci_p2pdma	= nvme_pci_supports_pci_p2pdma,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 11/11] nvme-pci: Convert to using dma_map_sg for p2pdma pages
  2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
                   ` (9 preceding siblings ...)
  2021-03-11 23:31 ` [RFC PATCH v2 10/11] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA Logan Gunthorpe
@ 2021-03-11 23:31 ` Logan Gunthorpe
  2021-03-11 23:59   ` Jason Gunthorpe
  2021-03-12 15:51 ` [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Robin Murphy
  11 siblings, 1 reply; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-11 23:31 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Logan Gunthorpe

Convert to using dma_[un]map_sg() for PCI p2pdma pages.

This should be equivalent, though support will be somewhat less
(only dma-direct and dma-iommu are currently supported).

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/pci.c | 27 +++++++--------------------
 1 file changed, 7 insertions(+), 20 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7d40c6a9e58e..89ca5acf7a62 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -577,17 +577,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct request *req)
 
 }
 
-static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
-{
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-	if (is_pci_p2pdma_page(sg_page(iod->sg)))
-		pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
-				    rq_dma_dir(req));
-	else
-		dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-}
-
 static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -600,7 +589,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 
 	WARN_ON_ONCE(!iod->nents);
 
-	nvme_unmap_sg(dev, req);
+	dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
 	if (iod->npages == 0)
 		dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
 			      iod->first_dma);
@@ -868,13 +857,11 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	if (!iod->nents)
 		goto out_free_sg;
 
-	if (is_pci_p2pdma_page(sg_page(iod->sg)))
-		nr_mapped = pci_p2pdma_map_sg_attrs(dev->dev, iod->sg,
-				iod->nents, rq_dma_dir(req), DMA_ATTR_NO_WARN);
-	else
-		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
-					     rq_dma_dir(req), DMA_ATTR_NO_WARN);
-	if (!nr_mapped)
+	nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+				     rq_dma_dir(req), DMA_ATTR_NO_WARN);
+	if (nr_mapped < 0)
+		ret = BLK_STS_P2PDMA;
+	if (nr_mapped <= 0)
 		goto out_free_sg;
 
 	iod->use_sgl = nvme_pci_use_sgls(dev, req);
@@ -887,7 +874,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	return BLK_STS_OK;
 
 out_unmap_sg:
-	nvme_unmap_sg(dev, req);
+	dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
 out_free_sg:
 	mempool_free(iod->sg, dev->iod_mempool);
 	return ret;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 11/11] nvme-pci: Convert to using dma_map_sg for p2pdma pages
  2021-03-11 23:31 ` [RFC PATCH v2 11/11] nvme-pci: Convert to using dma_map_sg for p2pdma pages Logan Gunthorpe
@ 2021-03-11 23:59   ` Jason Gunthorpe
  2021-03-12  1:37     ` Logan Gunthorpe
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Gunthorpe @ 2021-03-11 23:59 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin

On Thu, Mar 11, 2021 at 04:31:41PM -0700, Logan Gunthorpe wrote:
> Convert to using dma_[un]map_sg() for PCI p2pdma pages.
> 
> This should be equivalent, though support will be somewhat less
> (only dma-direct and dma-iommu are currently supported).
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>  drivers/nvme/host/pci.c | 27 +++++++--------------------
>  1 file changed, 7 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 7d40c6a9e58e..89ca5acf7a62 100644
> +++ b/drivers/nvme/host/pci.c
> @@ -577,17 +577,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct request *req)
>  
>  }
>  
> -static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
> -{
> -	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> -
> -	if (is_pci_p2pdma_page(sg_page(iod->sg)))
> -		pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
> -				    rq_dma_dir(req));
> -	else
> -		dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
> -}

Can the two other places with this code pattern be changed too?

Jason

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 11/11] nvme-pci: Convert to using dma_map_sg for p2pdma pages
  2021-03-11 23:59   ` Jason Gunthorpe
@ 2021-03-12  1:37     ` Logan Gunthorpe
  0 siblings, 0 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-12  1:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin



On 2021-03-11 4:59 p.m., Jason Gunthorpe wrote:
> On Thu, Mar 11, 2021 at 04:31:41PM -0700, Logan Gunthorpe wrote:
>> Convert to using dma_[un]map_sg() for PCI p2pdma pages.
>>
>> This should be equivalent, though support will be somewhat less
>> (only dma-direct and dma-iommu are currently supported).
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>>  drivers/nvme/host/pci.c | 27 +++++++--------------------
>>  1 file changed, 7 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 7d40c6a9e58e..89ca5acf7a62 100644
>> +++ b/drivers/nvme/host/pci.c
>> @@ -577,17 +577,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct request *req)
>>  
>>  }
>>  
>> -static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
>> -{
>> -	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>> -
>> -	if (is_pci_p2pdma_page(sg_page(iod->sg)))
>> -		pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
>> -				    rq_dma_dir(req));
>> -	else
>> -		dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
>> -}
> 
> Can the two other places with this code pattern be changed too?

Yes, if this goes forward, I imagine completely dropping
pci_p2pdma_unmap_sg().

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA
  2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
                   ` (10 preceding siblings ...)
  2021-03-11 23:31 ` [RFC PATCH v2 11/11] nvme-pci: Convert to using dma_map_sg for p2pdma pages Logan Gunthorpe
@ 2021-03-12 15:51 ` Robin Murphy
  2021-03-12 16:18   ` Logan Gunthorpe
  11 siblings, 1 reply; 47+ messages in thread
From: Robin Murphy @ 2021-03-12 15:51 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Minturn Dave B, John Hubbard, Dave Hansen, Ira Weiny,
	Matthew Wilcox, Christian König, Jason Gunthorpe,
	Jason Ekstrand, Daniel Vetter, Dan Williams, Stephen Bates,
	Jakowski Andrzej, Christoph Hellwig, Xiong Jianxin

On 2021-03-11 23:31, Logan Gunthorpe wrote:
> Hi,
> 
> This is a rework of the first half of my RFC for doing P2PDMA in userspace
> with O_DIRECT[1].
> 
> The largest issue with that series was the gross way of flagging P2PDMA
> SGL segments. This RFC proposes a different approach, (suggested by
> Dan Williams[2]) which uses the third bit in the page_link field of the
> SGL.
> 
> This approach is a lot less hacky but comes at the cost of adding a
> CONFIG_64BIT dependency to CONFIG_PCI_P2PDMA and using up the last
> scarce bit in the page_link. For our purposes, a 64BIT restriction is
> acceptable but it's not clear if this is ok for all usecases hoping
> to make use of P2PDMA.
> 
> Matthew Wilcox has already suggested (off-list) that this is the wrong
> approach, preferring a new dma mapping operation and an SGL replacement. I
> don't disagree that something along those lines would be a better long
> term solution, but it involves overcoming a lot of challenges to get
> there. Creating a new mapping operation still means adding support to more
> than 25 dma_map_ops implementations (many of which are on obscure
> architectures) or creating a redundant path to fallback with dma_map_sg()
> for every driver that uses the new operation. This RFC is an approach
> that doesn't require overcoming these blocks.

I don't really follow that argument - you're only adding support to two 
implementations with the awkward flag, so why would using a dedicated 
operation instead be any different? Whatever callers need to do if 
dma_pci_p2pdma_supported() says no, they could equally do if 
dma_map_p2p_sg() (or whatever) returns -ENXIO, no?

We don't try to multiplex .map_resource through .map_page, so there 
doesn't seem to be any good reason to force that complexity on .map_sg 
either. And having a distinct API from the outset should make it a lot 
easier to transition to better "list of P2P memory regions" data 
structures in future without rewriting the whole world. As it is, there 
are potential benefits in a P2P interface which can define its own 
behaviour - for instance if threw out the notion of segment merging it 
could save a load of bother by just maintaining the direct correlation 
between pages and DMA addresses.

Robin.

> Any alternative ideas or feedback is welcome.
> 
> These patches are based on v5.12-rc2 and a git branch is available here:
> 
>    https://github.com/sbates130272/linux-p2pmem/  p2pdma_dma_map_ops_rfc
> 
> A branch with the patches from the previous RFC that add userspace
> O_DIRECT support is available at the same URL with the name
> "p2pdma_dma_map_ops_rfc+user" (however, none of the issues with those
> extra patches from the feedback of the last posting have been fixed).
> 
> Thanks,
> 
> Logan
> 
> [1] https://lore.kernel.org/linux-block/20201106170036.18713-1-logang@deltatee.com/
> [2] https://lore.kernel.org/linux-block/CAPcyv4ifGcrdOtUt8qr7pmFhmecGHqGVre9G0RorGczCGVECQQ@mail.gmail.com/
> 
> --
> 
> Logan Gunthorpe (11):
>    PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()
>    PCI/P2PDMA: Avoid pci_get_slot() which sleeps
>    PCI/P2PDMA: Attempt to set map_type if it has not been set
>    PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and
>      pci_p2pdma_bus_offset()
>    lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
>    dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
>    dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
>    iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
>    block: Add BLK_STS_P2PDMA
>    nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
>    nvme-pci: Convert to using dma_map_sg for p2pdma pages
> 
>   block/blk-core.c            |  2 +
>   drivers/iommu/dma-iommu.c   | 63 +++++++++++++++++++++-----
>   drivers/nvme/host/core.c    |  3 +-
>   drivers/nvme/host/nvme.h    |  2 +-
>   drivers/nvme/host/pci.c     | 38 +++++++---------
>   drivers/pci/Kconfig         |  2 +-
>   drivers/pci/p2pdma.c        | 89 +++++++++++++++++++++++++++++++------
>   include/linux/blk_types.h   |  7 +++
>   include/linux/dma-map-ops.h |  3 ++
>   include/linux/dma-mapping.h |  5 +++
>   include/linux/pci-p2pdma.h  | 11 +++++
>   include/linux/scatterlist.h | 49 ++++++++++++++++++--
>   kernel/dma/direct.c         | 35 +++++++++++++--
>   kernel/dma/mapping.c        | 21 +++++++--
>   14 files changed, 271 insertions(+), 59 deletions(-)
> 
> 
> base-commit: a38fd8748464831584a19438cbb3082b5a2dab15
> --
> 2.20.1
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  2021-03-11 23:31 ` [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
@ 2021-03-12 15:52   ` Robin Murphy
  2021-03-12 16:24     ` Logan Gunthorpe
  2021-03-16  8:11   ` Christoph Hellwig
  1 sibling, 1 reply; 47+ messages in thread
From: Robin Murphy @ 2021-03-12 15:52 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Minturn Dave B, John Hubbard, Dave Hansen, Ira Weiny,
	Matthew Wilcox, Christian König, Jason Gunthorpe,
	Jason Ekstrand, Daniel Vetter, Dan Williams, Stephen Bates,
	Jakowski Andrzej, Christoph Hellwig, Xiong Jianxin

On 2021-03-11 23:31, Logan Gunthorpe wrote:
> Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
> PCI P2PDMA pages directly without a hack in the callers. This allows
> for heterogeneous SGLs that contain both P2PDMA and regular pages.
> 
> SGL segments that contain PCI bus addresses are marked with
> sg_mark_pci_p2pdma() and are ignored when unmapped.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>   kernel/dma/direct.c  | 35 ++++++++++++++++++++++++++++++++---
>   kernel/dma/mapping.c | 13 ++++++++++---
>   2 files changed, 42 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 002268262c9a..f326d32062dd 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -13,6 +13,7 @@
>   #include <linux/vmalloc.h>
>   #include <linux/set_memory.h>
>   #include <linux/slab.h>
> +#include <linux/pci-p2pdma.h>
>   #include "direct.h"
>   
>   /*
> @@ -387,19 +388,47 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
>   	struct scatterlist *sg;
>   	int i;
>   
> -	for_each_sg(sgl, sg, nents, i)
> +	for_each_sg(sgl, sg, nents, i) {
> +		if (sg_is_pci_p2pdma(sg))
> +			continue;
> +
>   		dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
>   			     attrs);
> +	}
>   }
>   #endif
>   
>   int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
>   		enum dma_data_direction dir, unsigned long attrs)
>   {
> -	int i;
> +	struct dev_pagemap *pgmap = NULL;
> +	int i, map = -1, ret = 0;
>   	struct scatterlist *sg;
> +	u64 bus_off;
>   
>   	for_each_sg(sgl, sg, nents, i) {
> +		if (is_pci_p2pdma_page(sg_page(sg))) {
> +			if (sg_page(sg)->pgmap != pgmap) {
> +				pgmap = sg_page(sg)->pgmap;
> +				map = pci_p2pdma_dma_map_type(dev, pgmap);
> +				bus_off = pci_p2pdma_bus_offset(sg_page(sg));
> +			}
> +
> +			if (map < 0) {
> +				sg->dma_address = DMA_MAPPING_ERROR;
> +				ret = -EREMOTEIO;
> +				goto out_unmap;
> +			}
> +
> +			if (map) {
> +				sg->dma_address = sg_phys(sg) + sg->offset -
> +					bus_off;
> +				sg_dma_len(sg) = sg->length;
> +				sg_mark_pci_p2pdma(sg);
> +				continue;
> +			}
> +		}
> +
>   		sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
>   				sg->offset, sg->length, dir, attrs);
>   		if (sg->dma_address == DMA_MAPPING_ERROR)
> @@ -411,7 +440,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
>   
>   out_unmap:
>   	dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
> -	return 0;
> +	return ret;
>   }
>   
>   dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index b6a633679933..adc1a83950be 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -178,8 +178,15 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
>   EXPORT_SYMBOL(dma_unmap_page_attrs);
>   
>   /*
> - * dma_maps_sg_attrs returns 0 on error and > 0 on success.
> - * It should never return a value < 0.
> + * dma_maps_sg_attrs returns 0 on any resource error and > 0 on success.
> + *
> + * If 0 is returned, the mapping can be retried and will succeed once
> + * sufficient resources are available.

That's not a guarantee we can uphold. Retrying forever in the vain hope 
that a device might evolve some extra address bits, or a bounce buffer 
might magically grow big enough for a gigantic mapping, isn't 
necessarily the best idea.

> + *
> + * If there are P2PDMA pages in the scatterlist then this function may
> + * return -EREMOTEIO to indicate that the pages are not mappable by the
> + * device. In this case, an error should be returned for the IO as it
> + * will never be successfully retried.
>    */
>   int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
>   		enum dma_data_direction dir, unsigned long attrs)
> @@ -197,7 +204,7 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
>   		ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
>   	else
>   		ents = ops->map_sg(dev, sg, nents, dir, attrs);
> -	BUG_ON(ents < 0);
> +

This scares me - I hesitate to imagine the amount of driver/subsystem 
code out there that will see nonzero and merrily set off iterating a 
negative number of segments, if we open the floodgates of allowing 
implementations to return error codes here.

Robin.

>   	debug_dma_map_sg(dev, sg, nents, ents, dir);
>   
>   	return ents;
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 08/11] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
  2021-03-11 23:31 ` [RFC PATCH v2 08/11] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg Logan Gunthorpe
@ 2021-03-12 15:52   ` Robin Murphy
  2021-03-12 17:03     ` Logan Gunthorpe
  0 siblings, 1 reply; 47+ messages in thread
From: Robin Murphy @ 2021-03-12 15:52 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Minturn Dave B, John Hubbard, Dave Hansen, Ira Weiny,
	Matthew Wilcox, Christian König, Jason Gunthorpe,
	Jason Ekstrand, Daniel Vetter, Dan Williams, Stephen Bates,
	Jakowski Andrzej, Christoph Hellwig, Xiong Jianxin

On 2021-03-11 23:31, Logan Gunthorpe wrote:
> When a PCI P2PDMA page is seen, set the IOVA length of the segment
> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
> apply the appropriate bus address to the segment. The IOVA is not
> created if the scatterlist only consists of P2PDMA pages.

This misled me at first, but I see the implementation does actually 
appear to accomodate the case of working ACS where P2P *would* still 
need to be mapped at the IOMMU.

> Similar to dma-direct, the sg_mark_pci_p2pdma() flag is used to
> indicate bus address segments. On unmap, P2PDMA segments are skipped
> over when determining the start and end IOVA addresses.
> 
> With this change, the flags variable in the dma_map_ops is
> set to DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for
> P2PDMA pages.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>   drivers/iommu/dma-iommu.c | 63 ++++++++++++++++++++++++++++++++-------
>   1 file changed, 53 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index af765c813cc8..c0821e9051a9 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -20,6 +20,7 @@
>   #include <linux/mm.h>
>   #include <linux/mutex.h>
>   #include <linux/pci.h>
> +#include <linux/pci-p2pdma.h>
>   #include <linux/swiotlb.h>
>   #include <linux/scatterlist.h>
>   #include <linux/vmalloc.h>
> @@ -846,7 +847,7 @@ static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
>    * segment's start address to avoid concatenating across one.
>    */
>   static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
> -		dma_addr_t dma_addr)
> +		dma_addr_t dma_addr, unsigned long attrs)
>   {
>   	struct scatterlist *s, *cur = sg;
>   	unsigned long seg_mask = dma_get_seg_boundary(dev);
> @@ -864,6 +865,20 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
>   		sg_dma_address(s) = DMA_MAPPING_ERROR;
>   		sg_dma_len(s) = 0;
>   
> +		if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
> +			if (i > 0)
> +				cur = sg_next(cur);
> +
> +			sg_dma_address(cur) = sg_phys(s) + s->offset -

Are you sure about that? ;)

> +				pci_p2pdma_bus_offset(sg_page(s));

Can the bus offset make P2P addresses overlap with regions of mem space 
that we might use for regular IOVA allocation? That would be very bad...

> +			sg_dma_len(cur) = s->length;
> +			sg_mark_pci_p2pdma(cur);
> +
> +			count++;
> +			cur_len = 0;
> +			continue;
> +		}
> +
>   		/*
>   		 * Now fill in the real DMA data. If...
>   		 * - there is a valid output segment to append to
> @@ -960,11 +975,12 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>   	struct iommu_dma_cookie *cookie = domain->iova_cookie;
>   	struct iova_domain *iovad = &cookie->iovad;
>   	struct scatterlist *s, *prev = NULL;
> +	struct dev_pagemap *pgmap = NULL;
>   	int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
>   	dma_addr_t iova;
>   	size_t iova_len = 0;
>   	unsigned long mask = dma_get_seg_boundary(dev);
> -	int i;
> +	int i, map = -1, ret = 0;
>   
>   	if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
>   	    iommu_deferred_attach(dev, domain))
> @@ -993,6 +1009,23 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>   		s_length = iova_align(iovad, s_length + s_iova_off);
>   		s->length = s_length;
>   
> +		if (is_pci_p2pdma_page(sg_page(s))) {
> +			if (sg_page(s)->pgmap != pgmap) {
> +				pgmap = sg_page(s)->pgmap;
> +				map = pci_p2pdma_dma_map_type(dev, pgmap);
> +			}
> +
> +			if (map < 0) {

It rather feels like it should be the job of whoever creates the list in 
the first place not to put unusable pages in it, especially since the 
p2pdma_map_type looks to be a fairly coarse-grained and static thing. 
The DMA API isn't responsible for validating normal memory pages, so 
what makes P2P special?

> +				ret = -EREMOTEIO;
> +				goto out_restore_sg;
> +			}
> +
> +			if (map) {
> +				s->length = 0;

I'm not really thrilled about the idea of passing zero-length segments 
to iommu_map_sg(). Yes, it happens to trick the concatenation logic in 
the current implementation into doing what you want, but it feels fragile.

> +				continue;
> +			}
> +		}
> +
>   		/*
>   		 * Due to the alignment of our single IOVA allocation, we can
>   		 * depend on these assumptions about the segment boundary mask:
> @@ -1015,6 +1048,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>   		prev = s;
>   	}
>   
> +	if (!iova_len)
> +		return __finalise_sg(dev, sg, nents, 0, attrs);
> +
>   	iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
>   	if (!iova)
>   		goto out_restore_sg;
> @@ -1026,19 +1062,19 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>   	if (iommu_map_sg_atomic(domain, iova, sg, nents, prot) < iova_len)
>   		goto out_free_iova;
>   
> -	return __finalise_sg(dev, sg, nents, iova);
> +	return __finalise_sg(dev, sg, nents, iova, attrs);
>   
>   out_free_iova:
>   	iommu_dma_free_iova(cookie, iova, iova_len, NULL);
>   out_restore_sg:
>   	__invalidate_sg(sg, nents);
> -	return 0;
> +	return ret;
>   }
>   
>   static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
>   		int nents, enum dma_data_direction dir, unsigned long attrs)
>   {
> -	dma_addr_t start, end;
> +	dma_addr_t end, start = DMA_MAPPING_ERROR;
>   	struct scatterlist *tmp;
>   	int i;
>   
> @@ -1054,14 +1090,20 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
>   	 * The scatterlist segments are mapped into a single
>   	 * contiguous IOVA allocation, so this is incredibly easy.
>   	 */
> -	start = sg_dma_address(sg);
> -	for_each_sg(sg_next(sg), tmp, nents - 1, i) {
> +	for_each_sg(sg, tmp, nents, i) {
> +		if (sg_is_pci_p2pdma(tmp))

Since the flag is associated with the DMA address which will no longer 
be valid, shouldn't it be cleared? The circumstances in which leaving it 
around could cause a problem are tenuous, but definitely possible.

Robin.

> +			continue;
>   		if (sg_dma_len(tmp) == 0)
>   			break;
> -		sg = tmp;
> +
> +		if (start == DMA_MAPPING_ERROR)
> +			start = sg_dma_address(tmp);
> +
> +		end = sg_dma_address(tmp) + sg_dma_len(tmp);
>   	}
> -	end = sg_dma_address(sg) + sg_dma_len(sg);
> -	__iommu_dma_unmap(dev, start, end - start);
> +
> +	if (start != DMA_MAPPING_ERROR)
> +		__iommu_dma_unmap(dev, start, end - start);
>   }
>   
>   static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
> @@ -1254,6 +1296,7 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
>   }
>   
>   static const struct dma_map_ops iommu_dma_ops = {
> +	.flags			= DMA_F_PCI_P2PDMA_SUPPORTED,
>   	.alloc			= iommu_dma_alloc,
>   	.free			= iommu_dma_free,
>   	.alloc_pages		= dma_common_alloc_pages,
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA
  2021-03-12 15:51 ` [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Robin Murphy
@ 2021-03-12 16:18   ` Logan Gunthorpe
  2021-03-12 17:46     ` Robin Murphy
  0 siblings, 1 reply; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-12 16:18 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-nvme, linux-block, linux-pci,
	linux-mm, iommu
  Cc: Minturn Dave B, John Hubbard, Dave Hansen, Ira Weiny,
	Matthew Wilcox, Christian König, Jason Gunthorpe,
	Jason Ekstrand, Daniel Vetter, Dan Williams, Stephen Bates,
	Jakowski Andrzej, Christoph Hellwig, Xiong Jianxin



On 2021-03-12 8:51 a.m., Robin Murphy wrote:
> On 2021-03-11 23:31, Logan Gunthorpe wrote:
>> Hi,
>>
>> This is a rework of the first half of my RFC for doing P2PDMA in
>> userspace
>> with O_DIRECT[1].
>>
>> The largest issue with that series was the gross way of flagging P2PDMA
>> SGL segments. This RFC proposes a different approach, (suggested by
>> Dan Williams[2]) which uses the third bit in the page_link field of the
>> SGL.
>>
>> This approach is a lot less hacky but comes at the cost of adding a
>> CONFIG_64BIT dependency to CONFIG_PCI_P2PDMA and using up the last
>> scarce bit in the page_link. For our purposes, a 64BIT restriction is
>> acceptable but it's not clear if this is ok for all usecases hoping
>> to make use of P2PDMA.
>>
>> Matthew Wilcox has already suggested (off-list) that this is the wrong
>> approach, preferring a new dma mapping operation and an SGL
>> replacement. I
>> don't disagree that something along those lines would be a better long
>> term solution, but it involves overcoming a lot of challenges to get
>> there. Creating a new mapping operation still means adding support to
>> more
>> than 25 dma_map_ops implementations (many of which are on obscure
>> architectures) or creating a redundant path to fallback with dma_map_sg()
>> for every driver that uses the new operation. This RFC is an approach
>> that doesn't require overcoming these blocks.
> 
> I don't really follow that argument - you're only adding support to two
> implementations with the awkward flag, so why would using a dedicated
> operation instead be any different? Whatever callers need to do if
> dma_pci_p2pdma_supported() says no, they could equally do if
> dma_map_p2p_sg() (or whatever) returns -ENXIO, no?

The thing is if the dma_map_sg doesn't support P2PDMA then P2PDMA
transactions cannot be done, but regular transactions can still go
through as they always did.

But replacing dma_map_sg() with dma_map_new() affects all operations,
P2PDMA or otherwise. If dma_map_new() isn't supported it can't simply
not support P2PDMA; it has to maintain a fallback path to dma_map_sg().
Given that the inputs and outputs for dma_map_new() will be completely
different data structures this will be quite a lot of similar paths
required in the driver. (ie mapping a bvec to the input struct and the
output struct to hardware requirements) If a bug crops up in the old
dma_map_sg(), developers might not notice it for some time seeing it
won't be used on the most popular architectures.

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  2021-03-12 15:52   ` Robin Murphy
@ 2021-03-12 16:24     ` Logan Gunthorpe
  2021-03-12 18:11       ` Robin Murphy
  0 siblings, 1 reply; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-12 16:24 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-nvme, linux-block, linux-pci,
	linux-mm, iommu
  Cc: Minturn Dave B, John Hubbard, Dave Hansen, Ira Weiny,
	Matthew Wilcox, Christian König, Jason Gunthorpe,
	Jason Ekstrand, Daniel Vetter, Dan Williams, Stephen Bates,
	Jakowski Andrzej, Christoph Hellwig, Xiong Jianxin



On 2021-03-12 8:52 a.m., Robin Murphy wrote:
>> +
>>           sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
>>                   sg->offset, sg->length, dir, attrs);
>>           if (sg->dma_address == DMA_MAPPING_ERROR)
>> @@ -411,7 +440,7 @@ int dma_direct_map_sg(struct device *dev, struct
>> scatterlist *sgl, int nents,
>>     out_unmap:
>>       dma_direct_unmap_sg(dev, sgl, i, dir, attrs |
>> DMA_ATTR_SKIP_CPU_SYNC);
>> -    return 0;
>> +    return ret;
>>   }
>>     dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t
>> paddr,
>> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
>> index b6a633679933..adc1a83950be 100644
>> --- a/kernel/dma/mapping.c
>> +++ b/kernel/dma/mapping.c
>> @@ -178,8 +178,15 @@ void dma_unmap_page_attrs(struct device *dev,
>> dma_addr_t addr, size_t size,
>>   EXPORT_SYMBOL(dma_unmap_page_attrs);
>>     /*
>> - * dma_maps_sg_attrs returns 0 on error and > 0 on success.
>> - * It should never return a value < 0.
>> + * dma_maps_sg_attrs returns 0 on any resource error and > 0 on success.
>> + *
>> + * If 0 is returned, the mapping can be retried and will succeed once
>> + * sufficient resources are available.
> 
> That's not a guarantee we can uphold. Retrying forever in the vain hope
> that a device might evolve some extra address bits, or a bounce buffer
> might magically grow big enough for a gigantic mapping, isn't
> necessarily the best idea.

Perhaps this is just poorly worded. Returning 0 is the normal case and
nothing has changed there. The block layer, for example, will retry if
zero is returned as this only happens if it failed to allocate resources
for the mapping. The reason we have to return -1 is to tell the block
layer not to retry these requests as they will never succeed in the future.

>> + *
>> + * If there are P2PDMA pages in the scatterlist then this function may
>> + * return -EREMOTEIO to indicate that the pages are not mappable by the
>> + * device. In this case, an error should be returned for the IO as it
>> + * will never be successfully retried.
>>    */
>>   int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int
>> nents,
>>           enum dma_data_direction dir, unsigned long attrs)
>> @@ -197,7 +204,7 @@ int dma_map_sg_attrs(struct device *dev, struct
>> scatterlist *sg, int nents,
>>           ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
>>       else
>>           ents = ops->map_sg(dev, sg, nents, dir, attrs);
>> -    BUG_ON(ents < 0);
>> +
> 
> This scares me - I hesitate to imagine the amount of driver/subsystem
> code out there that will see nonzero and merrily set off iterating a
> negative number of segments, if we open the floodgates of allowing
> implementations to return error codes here.

Yes, but it will never happen on existing drivers/subsystems. The only
way it can return a negative number is if the driver passes in P2PDMA
pages which can't happen without changes in the driver. We are careful
about where P2PDMA pages can get into so we don't have to worry about
all the existing driver code out there.

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 08/11] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
  2021-03-12 15:52   ` Robin Murphy
@ 2021-03-12 17:03     ` Logan Gunthorpe
  2021-03-12 19:47       ` Robin Murphy
  0 siblings, 1 reply; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-12 17:03 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-nvme, linux-block, linux-pci,
	linux-mm, iommu
  Cc: Minturn Dave B, John Hubbard, Dave Hansen, Ira Weiny,
	Matthew Wilcox, Christian König, Jason Gunthorpe,
	Jason Ekstrand, Daniel Vetter, Dan Williams, Stephen Bates,
	Jakowski Andrzej, Christoph Hellwig, Xiong Jianxin



On 2021-03-12 8:52 a.m., Robin Murphy wrote:
> On 2021-03-11 23:31, Logan Gunthorpe wrote:
>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
>> apply the appropriate bus address to the segment. The IOVA is not
>> created if the scatterlist only consists of P2PDMA pages.
> 
> This misled me at first, but I see the implementation does actually
> appear to accomodate the case of working ACS where P2P *would* still
> need to be mapped at the IOMMU.

Yes, that's correct.
>>   static int __finalise_sg(struct device *dev, struct scatterlist *sg,
>> int nents,
>> -        dma_addr_t dma_addr)
>> +        dma_addr_t dma_addr, unsigned long attrs)
>>   {
>>       struct scatterlist *s, *cur = sg;
>>       unsigned long seg_mask = dma_get_seg_boundary(dev);
>> @@ -864,6 +865,20 @@ static int __finalise_sg(struct device *dev,
>> struct scatterlist *sg, int nents,
>>           sg_dma_address(s) = DMA_MAPPING_ERROR;
>>           sg_dma_len(s) = 0;
>>   +        if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
>> +            if (i > 0)
>> +                cur = sg_next(cur);
>> +
>> +            sg_dma_address(cur) = sg_phys(s) + s->offset -
> 
> Are you sure about that? ;)

Do you see a bug? I don't follow you...

>> +                pci_p2pdma_bus_offset(sg_page(s));
> 
> Can the bus offset make P2P addresses overlap with regions of mem space
> that we might use for regular IOVA allocation? That would be very bad...

No. IOMMU drivers already disallow all PCI addresses from being used as
IOVA addresses. See, for example,  dmar_init_reserved_ranges(). It would
be a huge problem for a whole lot of other reasons if it didn't.


>> @@ -960,11 +975,12 @@ static int iommu_dma_map_sg(struct device *dev,
>> struct scatterlist *sg,
>>       struct iommu_dma_cookie *cookie = domain->iova_cookie;
>>       struct iova_domain *iovad = &cookie->iovad;
>>       struct scatterlist *s, *prev = NULL;
>> +    struct dev_pagemap *pgmap = NULL;
>>       int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
>>       dma_addr_t iova;
>>       size_t iova_len = 0;
>>       unsigned long mask = dma_get_seg_boundary(dev);
>> -    int i;
>> +    int i, map = -1, ret = 0;
>>         if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
>>           iommu_deferred_attach(dev, domain))
>> @@ -993,6 +1009,23 @@ static int iommu_dma_map_sg(struct device *dev,
>> struct scatterlist *sg,
>>           s_length = iova_align(iovad, s_length + s_iova_off);
>>           s->length = s_length;
>>   +        if (is_pci_p2pdma_page(sg_page(s))) {
>> +            if (sg_page(s)->pgmap != pgmap) {
>> +                pgmap = sg_page(s)->pgmap;
>> +                map = pci_p2pdma_dma_map_type(dev, pgmap);
>> +            }
>> +
>> +            if (map < 0) {
> 
> It rather feels like it should be the job of whoever creates the list in
> the first place not to put unusable pages in it, especially since the
> p2pdma_map_type looks to be a fairly coarse-grained and static thing.
> The DMA API isn't responsible for validating normal memory pages, so
> what makes P2P special?

Yes, that would be ideal, but there's some difficulties there. For the
driver to check the pages, it would need to loop through the entire SG
one more time on every transaction, regardless of whether there are
P2PDMA pages, or not. So that will have a performance impact even when
the feature isn't being used. I don't think that'll be acceptable for
many drivers.

The other possibility is for GUP to do it when it gets the pages from
userspace. But GUP doesn't have all the information to do this at the
moment. We'd have to pass the struct device that will eventually map the
pages through all the nested functions in the GUP to do that test at
that time. This might not be a bad option (that I half looked into), but
I'm not sure how acceptable it would be to the GUP developers.

But even if we do verify the pages ahead of time, we still need the same
infrastructure in dma_map_sg(); it could only now be a BUG if the driver
sent invalid pages instead of an error return.

>> +                ret = -EREMOTEIO;
>> +                goto out_restore_sg;
>> +            }
>> +
>> +            if (map) {
>> +                s->length = 0;
> 
> I'm not really thrilled about the idea of passing zero-length segments
> to iommu_map_sg(). Yes, it happens to trick the concatenation logic in
> the current implementation into doing what you want, but it feels fragile.

We're not passing zero length segments to iommu_map_sg() (or any
function). This loop is just scanning to calculate the length of the
required IOVA. __finalise_sg() (which is intimately tied to this loop)
then needs a way to determine which segments were P2P segments. The
existing code already overwrites s->length with an aligned length and
stores the original length in sg_dma_len. So we're not relying on
tricking any logic here.


>>   }
>>     static void iommu_dma_unmap_sg(struct device *dev, struct
>> scatterlist *sg,
>>           int nents, enum dma_data_direction dir, unsigned long attrs)
>>   {
>> -    dma_addr_t start, end;
>> +    dma_addr_t end, start = DMA_MAPPING_ERROR;
>>       struct scatterlist *tmp;
>>       int i;
>>   @@ -1054,14 +1090,20 @@ static void iommu_dma_unmap_sg(struct device
>> *dev, struct scatterlist *sg,
>>        * The scatterlist segments are mapped into a single
>>        * contiguous IOVA allocation, so this is incredibly easy.
>>        */
>> -    start = sg_dma_address(sg);
>> -    for_each_sg(sg_next(sg), tmp, nents - 1, i) {
>> +    for_each_sg(sg, tmp, nents, i) {
>> +        if (sg_is_pci_p2pdma(tmp))
> 
> Since the flag is associated with the DMA address which will no longer
> be valid, shouldn't it be cleared? The circumstances in which leaving it
> around could cause a problem are tenuous, but definitely possible.

Yes, that's a good idea.

Thanks for the review!

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA
  2021-03-12 16:18   ` Logan Gunthorpe
@ 2021-03-12 17:46     ` Robin Murphy
  2021-03-12 18:24       ` Logan Gunthorpe
  0 siblings, 1 reply; 47+ messages in thread
From: Robin Murphy @ 2021-03-12 17:46 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Minturn Dave B, John Hubbard, Dave Hansen, Ira Weiny,
	Matthew Wilcox, Christian König, Jason Gunthorpe,
	Jason Ekstrand, Daniel Vetter, Dan Williams, Stephen Bates,
	Jakowski Andrzej, Christoph Hellwig, Xiong Jianxin

On 2021-03-12 16:18, Logan Gunthorpe wrote:
> 
> 
> On 2021-03-12 8:51 a.m., Robin Murphy wrote:
>> On 2021-03-11 23:31, Logan Gunthorpe wrote:
>>> Hi,
>>>
>>> This is a rework of the first half of my RFC for doing P2PDMA in
>>> userspace
>>> with O_DIRECT[1].
>>>
>>> The largest issue with that series was the gross way of flagging P2PDMA
>>> SGL segments. This RFC proposes a different approach, (suggested by
>>> Dan Williams[2]) which uses the third bit in the page_link field of the
>>> SGL.
>>>
>>> This approach is a lot less hacky but comes at the cost of adding a
>>> CONFIG_64BIT dependency to CONFIG_PCI_P2PDMA and using up the last
>>> scarce bit in the page_link. For our purposes, a 64BIT restriction is
>>> acceptable but it's not clear if this is ok for all usecases hoping
>>> to make use of P2PDMA.
>>>
>>> Matthew Wilcox has already suggested (off-list) that this is the wrong
>>> approach, preferring a new dma mapping operation and an SGL
>>> replacement. I
>>> don't disagree that something along those lines would be a better long
>>> term solution, but it involves overcoming a lot of challenges to get
>>> there. Creating a new mapping operation still means adding support to
>>> more
>>> than 25 dma_map_ops implementations (many of which are on obscure
>>> architectures) or creating a redundant path to fallback with dma_map_sg()
>>> for every driver that uses the new operation. This RFC is an approach
>>> that doesn't require overcoming these blocks.
>>
>> I don't really follow that argument - you're only adding support to two
>> implementations with the awkward flag, so why would using a dedicated
>> operation instead be any different? Whatever callers need to do if
>> dma_pci_p2pdma_supported() says no, they could equally do if
>> dma_map_p2p_sg() (or whatever) returns -ENXIO, no?
> 
> The thing is if the dma_map_sg doesn't support P2PDMA then P2PDMA
> transactions cannot be done, but regular transactions can still go
> through as they always did.
> 
> But replacing dma_map_sg() with dma_map_new() affects all operations,
> P2PDMA or otherwise. If dma_map_new() isn't supported it can't simply
> not support P2PDMA; it has to maintain a fallback path to dma_map_sg().

But AFAICS the equivalent fallback path still has to exist either way. 
My impression so far is that callers would end up looking something like 
this:

	if (dma_pci_p2pdma_supported()) {
		if (dma_map_sg(...) < 0)
			//do non-p2p fallback due to p2p failure
	} else {
		//do non-p2p fallback due to lack of support
	}

at which point, simply:

	if (dma_map_sg_p2p(...) < 0)
		//do non-p2p fallback either way

seems entirely reasonable. What am I missing?

Let's not pretend that overloading an existing API means we can start 
feeding P2P pages into any old subsystem/driver without further changes 
- there already *are* at least some that retry ad infinitum if DMA 
mapping fails (the USB layer springs to mind...) and thus wouldn't 
handle the PCI_P2PDMA_MAP_NOT_SUPPORTED case acceptably.

> Given that the inputs and outputs for dma_map_new() will be completely
> different data structures this will be quite a lot of similar paths
> required in the driver. (ie mapping a bvec to the input struct and the
> output struct to hardware requirements) If a bug crops up in the old
> dma_map_sg(), developers might not notice it for some time seeing it
> won't be used on the most popular architectures.

Huh? I'm specifically suggesting a new interface that takes the *same* 
data structure (at least to begin with), but just gives us more 
flexibility in terms of introducing p2p-aware behaviour somewhat more 
safely. Yes, we already know that we ultimately want something better 
than scatterlists for representing things like this and dma-buf imports, 
but that hardly has to happen overnight.

Robin.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  2021-03-12 16:24     ` Logan Gunthorpe
@ 2021-03-12 18:11       ` Robin Murphy
  2021-03-12 18:27         ` Logan Gunthorpe
  2021-03-16  7:56         ` Christoph Hellwig
  0 siblings, 2 replies; 47+ messages in thread
From: Robin Murphy @ 2021-03-12 18:11 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Minturn Dave B, John Hubbard, Dave Hansen, Matthew Wilcox,
	Christian König, Jason Gunthorpe, Jason Ekstrand,
	Daniel Vetter, Dan Williams, Stephen Bates, Jakowski Andrzej,
	Christoph Hellwig, Xiong Jianxin

On 2021-03-12 16:24, Logan Gunthorpe wrote:
> 
> 
> On 2021-03-12 8:52 a.m., Robin Murphy wrote:
>>> +
>>>            sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
>>>                    sg->offset, sg->length, dir, attrs);
>>>            if (sg->dma_address == DMA_MAPPING_ERROR)
>>> @@ -411,7 +440,7 @@ int dma_direct_map_sg(struct device *dev, struct
>>> scatterlist *sgl, int nents,
>>>      out_unmap:
>>>        dma_direct_unmap_sg(dev, sgl, i, dir, attrs |
>>> DMA_ATTR_SKIP_CPU_SYNC);
>>> -    return 0;
>>> +    return ret;
>>>    }
>>>      dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t
>>> paddr,
>>> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
>>> index b6a633679933..adc1a83950be 100644
>>> --- a/kernel/dma/mapping.c
>>> +++ b/kernel/dma/mapping.c
>>> @@ -178,8 +178,15 @@ void dma_unmap_page_attrs(struct device *dev,
>>> dma_addr_t addr, size_t size,
>>>    EXPORT_SYMBOL(dma_unmap_page_attrs);
>>>      /*
>>> - * dma_maps_sg_attrs returns 0 on error and > 0 on success.
>>> - * It should never return a value < 0.
>>> + * dma_maps_sg_attrs returns 0 on any resource error and > 0 on success.
>>> + *
>>> + * If 0 is returned, the mapping can be retried and will succeed once
>>> + * sufficient resources are available.
>>
>> That's not a guarantee we can uphold. Retrying forever in the vain hope
>> that a device might evolve some extra address bits, or a bounce buffer
>> might magically grow big enough for a gigantic mapping, isn't
>> necessarily the best idea.
> 
> Perhaps this is just poorly worded. Returning 0 is the normal case and
> nothing has changed there. The block layer, for example, will retry if
> zero is returned as this only happens if it failed to allocate resources
> for the mapping. The reason we have to return -1 is to tell the block
> layer not to retry these requests as they will never succeed in the future.
> 
>>> + *
>>> + * If there are P2PDMA pages in the scatterlist then this function may
>>> + * return -EREMOTEIO to indicate that the pages are not mappable by the
>>> + * device. In this case, an error should be returned for the IO as it
>>> + * will never be successfully retried.
>>>     */
>>>    int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int
>>> nents,
>>>            enum dma_data_direction dir, unsigned long attrs)
>>> @@ -197,7 +204,7 @@ int dma_map_sg_attrs(struct device *dev, struct
>>> scatterlist *sg, int nents,
>>>            ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
>>>        else
>>>            ents = ops->map_sg(dev, sg, nents, dir, attrs);
>>> -    BUG_ON(ents < 0);
>>> +
>>
>> This scares me - I hesitate to imagine the amount of driver/subsystem
>> code out there that will see nonzero and merrily set off iterating a
>> negative number of segments, if we open the floodgates of allowing
>> implementations to return error codes here.
> 
> Yes, but it will never happen on existing drivers/subsystems. The only
> way it can return a negative number is if the driver passes in P2PDMA
> pages which can't happen without changes in the driver. We are careful
> about where P2PDMA pages can get into so we don't have to worry about
> all the existing driver code out there.

Sure, that's how things stand immediately after this patch. But then 
someone comes along with the perfectly reasonable argument for returning 
more expressive error information for regular mapping failures as well 
(because sometimes those can be terminal too, as above), we start to get 
divergent behaviour across architectures and random bits of old code 
subtly breaking down the line. *That* is what makes me wary of making a 
fundamental change to a long-standing "nonzero means success" interface...

Robin.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA
  2021-03-12 17:46     ` Robin Murphy
@ 2021-03-12 18:24       ` Logan Gunthorpe
  0 siblings, 0 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-12 18:24 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-nvme, linux-block, linux-pci,
	linux-mm, iommu
  Cc: Minturn Dave B, John Hubbard, Dave Hansen, Ira Weiny,
	Matthew Wilcox, Christian König, Jason Gunthorpe,
	Jason Ekstrand, Daniel Vetter, Dan Williams, Stephen Bates,
	Jakowski Andrzej, Christoph Hellwig, Xiong Jianxin



On 2021-03-12 10:46 a.m., Robin Murphy wrote:
> On 2021-03-12 16:18, Logan Gunthorpe wrote:
>>
>>
>> On 2021-03-12 8:51 a.m., Robin Murphy wrote:
>>> On 2021-03-11 23:31, Logan Gunthorpe wrote:
>>>> Hi,
>>>>
>>>> This is a rework of the first half of my RFC for doing P2PDMA in
>>>> userspace
>>>> with O_DIRECT[1].
>>>>
>>>> The largest issue with that series was the gross way of flagging P2PDMA
>>>> SGL segments. This RFC proposes a different approach, (suggested by
>>>> Dan Williams[2]) which uses the third bit in the page_link field of the
>>>> SGL.
>>>>
>>>> This approach is a lot less hacky but comes at the cost of adding a
>>>> CONFIG_64BIT dependency to CONFIG_PCI_P2PDMA and using up the last
>>>> scarce bit in the page_link. For our purposes, a 64BIT restriction is
>>>> acceptable but it's not clear if this is ok for all usecases hoping
>>>> to make use of P2PDMA.
>>>>
>>>> Matthew Wilcox has already suggested (off-list) that this is the wrong
>>>> approach, preferring a new dma mapping operation and an SGL
>>>> replacement. I
>>>> don't disagree that something along those lines would be a better long
>>>> term solution, but it involves overcoming a lot of challenges to get
>>>> there. Creating a new mapping operation still means adding support to
>>>> more
>>>> than 25 dma_map_ops implementations (many of which are on obscure
>>>> architectures) or creating a redundant path to fallback with
>>>> dma_map_sg()
>>>> for every driver that uses the new operation. This RFC is an approach
>>>> that doesn't require overcoming these blocks.
>>>
>>> I don't really follow that argument - you're only adding support to two
>>> implementations with the awkward flag, so why would using a dedicated
>>> operation instead be any different? Whatever callers need to do if
>>> dma_pci_p2pdma_supported() says no, they could equally do if
>>> dma_map_p2p_sg() (or whatever) returns -ENXIO, no?
>>
>> The thing is if the dma_map_sg doesn't support P2PDMA then P2PDMA
>> transactions cannot be done, but regular transactions can still go
>> through as they always did.
>>
>> But replacing dma_map_sg() with dma_map_new() affects all operations,
>> P2PDMA or otherwise. If dma_map_new() isn't supported it can't simply
>> not support P2PDMA; it has to maintain a fallback path to dma_map_sg().
> 
> But AFAICS the equivalent fallback path still has to exist either way.
> My impression so far is that callers would end up looking something like
> this:
> 
>     if (dma_pci_p2pdma_supported()) {
>         if (dma_map_sg(...) < 0)
>             //do non-p2p fallback due to p2p failure
>     } else {
>         //do non-p2p fallback due to lack of support
>     }
> 
> at which point, simply:
> 
>     if (dma_map_sg_p2p(...) < 0)
>         //do non-p2p fallback either way
> 
> seems entirely reasonable. What am I missing?

No, that's not how it works. The dma_pci_p2pdma_supported() flag gates
whether P2PDMA pages will be used at a much higher level. Currently with
NVMe, if P2PDMA is supported, it sets the QUEUE_FLAG_PCI_P2PDMA on the
block queue. This is then tested with blk_queue_pci_p2pdma() before any
P2PDMA transaction with the block device is started.

In the following patches that could add userspace support,
blk_queue_pci_p2pdma() is used to add a flag to GUP which allows P2PDMA
pages into the driver via O_DIRECT.

There is no fallback path on the dma_map_sg() operation if p2pdma is not
supported. dma_map_sg() is always used. The code simply prevents pages
from getting there in the first place.

A new DMA operation would have to be:

if (dma_has_new_operation()) {
     // create input for dma_map_new_op()
     // map with dma_map_new_op()
     // parse output of dma_map_new_op()
} else {
     // create SGL
     dma_map_sg()
     // convert SGL to hardware map
}

And this if statement has nothing to do with p2pdma support or not.

> 
> Let's not pretend that overloading an existing API means we can start
> feeding P2P pages into any old subsystem/driver without further changes
> - there already *are* at least some that retry ad infinitum if DMA
> mapping fails (the USB layer springs to mind...) and thus wouldn't
> handle the PCI_P2PDMA_MAP_NOT_SUPPORTED case acceptably.

Yes, nobody is pretending that at all. We are being very careful with
P2PDMA pages and we don't want them to get into any driver that isn't
explicitly written to expect them. With the code in the current kernel
this is all very explicit and done ahead of time (with
QUEUE_FLAG_PCI_P2PDMA and similar). Once the pages get into userspace,
GUP will reject them unless the driver getting the pages specifically
sets a flag indicating support for them.

>> Given that the inputs and outputs for dma_map_new() will be completely
>> different data structures this will be quite a lot of similar paths
>> required in the driver. (ie mapping a bvec to the input struct and the
>> output struct to hardware requirements) If a bug crops up in the old
>> dma_map_sg(), developers might not notice it for some time seeing it
>> won't be used on the most popular architectures.
> 
> Huh? I'm specifically suggesting a new interface that takes the *same*
> data structure (at least to begin with), but just gives us more
> flexibility in terms of introducing p2p-aware behaviour somewhat more
> safely. Yes, we already know that we ultimately want something better
> than scatterlists for representing things like this and dma-buf imports,
> but that hardly has to happen overnight.

Ok, well that's not what Matthew had suggested off list. He specifically
was suggesting new data structures. And yes, that isn't going to happen
overnight.

If we have a dma_map_sg() and a dma_map_sg_p2p() that are identical
except dma_map_sg_p2p() supports p2pdma memory and can return -1 then
that doesn't sound so bad to me. So dma_map_sg() would simply be call
dma_map_sg_p2p() and add the BUG() on the return code. Though I really
don't see the benefit of the extra annotations. I don't think it really
adds any value. The tricky thing in the API is the SGL which needs to
flag segments for P2PDMA and the new function doesn't solve that.

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  2021-03-12 18:11       ` Robin Murphy
@ 2021-03-12 18:27         ` Logan Gunthorpe
  2021-03-16  7:58           ` Christoph Hellwig
  2021-03-16  7:56         ` Christoph Hellwig
  1 sibling, 1 reply; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-12 18:27 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-nvme, linux-block, linux-pci,
	linux-mm, iommu
  Cc: Minturn Dave B, John Hubbard, Dave Hansen, Matthew Wilcox,
	Christian König, Jason Gunthorpe, Jason Ekstrand,
	Daniel Vetter, Dan Williams, Stephen Bates, Jakowski Andrzej,
	Christoph Hellwig, Xiong Jianxin



On 2021-03-12 11:11 a.m., Robin Murphy wrote:
> On 2021-03-12 16:24, Logan Gunthorpe wrote:
>>
>>
>> On 2021-03-12 8:52 a.m., Robin Murphy wrote:
>>>> +
>>>>            sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
>>>>                    sg->offset, sg->length, dir, attrs);
>>>>            if (sg->dma_address == DMA_MAPPING_ERROR)
>>>> @@ -411,7 +440,7 @@ int dma_direct_map_sg(struct device *dev, struct
>>>> scatterlist *sgl, int nents,
>>>>      out_unmap:
>>>>        dma_direct_unmap_sg(dev, sgl, i, dir, attrs |
>>>> DMA_ATTR_SKIP_CPU_SYNC);
>>>> -    return 0;
>>>> +    return ret;
>>>>    }
>>>>      dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t
>>>> paddr,
>>>> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
>>>> index b6a633679933..adc1a83950be 100644
>>>> --- a/kernel/dma/mapping.c
>>>> +++ b/kernel/dma/mapping.c
>>>> @@ -178,8 +178,15 @@ void dma_unmap_page_attrs(struct device *dev,
>>>> dma_addr_t addr, size_t size,
>>>>    EXPORT_SYMBOL(dma_unmap_page_attrs);
>>>>      /*
>>>> - * dma_maps_sg_attrs returns 0 on error and > 0 on success.
>>>> - * It should never return a value < 0.
>>>> + * dma_maps_sg_attrs returns 0 on any resource error and > 0 on
>>>> success.
>>>> + *
>>>> + * If 0 is returned, the mapping can be retried and will succeed once
>>>> + * sufficient resources are available.
>>>
>>> That's not a guarantee we can uphold. Retrying forever in the vain hope
>>> that a device might evolve some extra address bits, or a bounce buffer
>>> might magically grow big enough for a gigantic mapping, isn't
>>> necessarily the best idea.
>>
>> Perhaps this is just poorly worded. Returning 0 is the normal case and
>> nothing has changed there. The block layer, for example, will retry if
>> zero is returned as this only happens if it failed to allocate resources
>> for the mapping. The reason we have to return -1 is to tell the block
>> layer not to retry these requests as they will never succeed in the
>> future.
>>
>>>> + *
>>>> + * If there are P2PDMA pages in the scatterlist then this function may
>>>> + * return -EREMOTEIO to indicate that the pages are not mappable by
>>>> the
>>>> + * device. In this case, an error should be returned for the IO as it
>>>> + * will never be successfully retried.
>>>>     */
>>>>    int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int
>>>> nents,
>>>>            enum dma_data_direction dir, unsigned long attrs)
>>>> @@ -197,7 +204,7 @@ int dma_map_sg_attrs(struct device *dev, struct
>>>> scatterlist *sg, int nents,
>>>>            ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
>>>>        else
>>>>            ents = ops->map_sg(dev, sg, nents, dir, attrs);
>>>> -    BUG_ON(ents < 0);
>>>> +
>>>
>>> This scares me - I hesitate to imagine the amount of driver/subsystem
>>> code out there that will see nonzero and merrily set off iterating a
>>> negative number of segments, if we open the floodgates of allowing
>>> implementations to return error codes here.
>>
>> Yes, but it will never happen on existing drivers/subsystems. The only
>> way it can return a negative number is if the driver passes in P2PDMA
>> pages which can't happen without changes in the driver. We are careful
>> about where P2PDMA pages can get into so we don't have to worry about
>> all the existing driver code out there.
> 
> Sure, that's how things stand immediately after this patch. But then
> someone comes along with the perfectly reasonable argument for returning
> more expressive error information for regular mapping failures as well
> (because sometimes those can be terminal too, as above), we start to get
> divergent behaviour across architectures and random bits of old code
> subtly breaking down the line. *That* is what makes me wary of making a
> fundamental change to a long-standing "nonzero means success" interface...

So then we reject the patches that make that change. Seems like an odd
argument to say that we can't do something that won't cause problems
because someone might use it as an example and do something that will
cause problems. Reject the change that causes the problem.

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 08/11] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
  2021-03-12 17:03     ` Logan Gunthorpe
@ 2021-03-12 19:47       ` Robin Murphy
  2021-03-12 20:06         ` Logan Gunthorpe
  0 siblings, 1 reply; 47+ messages in thread
From: Robin Murphy @ 2021-03-12 19:47 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Minturn Dave B, John Hubbard, Dave Hansen, Ira Weiny,
	Matthew Wilcox, Christian König, Jason Gunthorpe,
	Jason Ekstrand, Daniel Vetter, Dan Williams, Stephen Bates,
	Jakowski Andrzej, Christoph Hellwig, Xiong Jianxin

On 2021-03-12 17:03, Logan Gunthorpe wrote:
> 
> 
> On 2021-03-12 8:52 a.m., Robin Murphy wrote:
>> On 2021-03-11 23:31, Logan Gunthorpe wrote:
>>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>>> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
>>> apply the appropriate bus address to the segment. The IOVA is not
>>> created if the scatterlist only consists of P2PDMA pages.
>>
>> This misled me at first, but I see the implementation does actually
>> appear to accomodate the case of working ACS where P2P *would* still
>> need to be mapped at the IOMMU.
> 
> Yes, that's correct.
>>>    static int __finalise_sg(struct device *dev, struct scatterlist *sg,
>>> int nents,
>>> -        dma_addr_t dma_addr)
>>> +        dma_addr_t dma_addr, unsigned long attrs)
>>>    {
>>>        struct scatterlist *s, *cur = sg;
>>>        unsigned long seg_mask = dma_get_seg_boundary(dev);
>>> @@ -864,6 +865,20 @@ static int __finalise_sg(struct device *dev,
>>> struct scatterlist *sg, int nents,
>>>            sg_dma_address(s) = DMA_MAPPING_ERROR;
>>>            sg_dma_len(s) = 0;
>>>    +        if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
>>> +            if (i > 0)
>>> +                cur = sg_next(cur);
>>> +
>>> +            sg_dma_address(cur) = sg_phys(s) + s->offset -
>>
>> Are you sure about that? ;)
> 
> Do you see a bug? I don't follow you...

sg_phys() already accounts for the offset, so you're adding it twice.

>>> +                pci_p2pdma_bus_offset(sg_page(s));
>>
>> Can the bus offset make P2P addresses overlap with regions of mem space
>> that we might use for regular IOVA allocation? That would be very bad...
> 
> No. IOMMU drivers already disallow all PCI addresses from being used as
> IOVA addresses. See, for example,  dmar_init_reserved_ranges(). It would
> be a huge problem for a whole lot of other reasons if it didn't.

I know we reserve the outbound windows (largely *because* some host 
bridges will consider those addresses as attempts at unsupported P2P and 
prevent them working), I just wanted to confirm that this bus offset is 
always something small that stays within the relevant window, rather 
than something that might make a BAR appear in a completely different 
place for P2P purposes. If so, that's good.

>>> @@ -960,11 +975,12 @@ static int iommu_dma_map_sg(struct device *dev,
>>> struct scatterlist *sg,
>>>        struct iommu_dma_cookie *cookie = domain->iova_cookie;
>>>        struct iova_domain *iovad = &cookie->iovad;
>>>        struct scatterlist *s, *prev = NULL;
>>> +    struct dev_pagemap *pgmap = NULL;
>>>        int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
>>>        dma_addr_t iova;
>>>        size_t iova_len = 0;
>>>        unsigned long mask = dma_get_seg_boundary(dev);
>>> -    int i;
>>> +    int i, map = -1, ret = 0;
>>>          if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
>>>            iommu_deferred_attach(dev, domain))
>>> @@ -993,6 +1009,23 @@ static int iommu_dma_map_sg(struct device *dev,
>>> struct scatterlist *sg,
>>>            s_length = iova_align(iovad, s_length + s_iova_off);
>>>            s->length = s_length;
>>>    +        if (is_pci_p2pdma_page(sg_page(s))) {
>>> +            if (sg_page(s)->pgmap != pgmap) {
>>> +                pgmap = sg_page(s)->pgmap;
>>> +                map = pci_p2pdma_dma_map_type(dev, pgmap);
>>> +            }
>>> +
>>> +            if (map < 0) {
>>
>> It rather feels like it should be the job of whoever creates the list in
>> the first place not to put unusable pages in it, especially since the
>> p2pdma_map_type looks to be a fairly coarse-grained and static thing.
>> The DMA API isn't responsible for validating normal memory pages, so
>> what makes P2P special?
> 
> Yes, that would be ideal, but there's some difficulties there. For the
> driver to check the pages, it would need to loop through the entire SG
> one more time on every transaction, regardless of whether there are
> P2PDMA pages, or not. So that will have a performance impact even when
> the feature isn't being used. I don't think that'll be acceptable for
> many drivers.
> 
> The other possibility is for GUP to do it when it gets the pages from
> userspace. But GUP doesn't have all the information to do this at the
> moment. We'd have to pass the struct device that will eventually map the
> pages through all the nested functions in the GUP to do that test at
> that time. This might not be a bad option (that I half looked into), but
> I'm not sure how acceptable it would be to the GUP developers.

Urgh, yes, if a page may or may not be valid for p2p depending on which 
device is trying to map it, then it probably is most reasonable to 
figure that out at this point. It's a little unfortunate having to cope 
with failure so late, but oh well.

> But even if we do verify the pages ahead of time, we still need the same
> infrastructure in dma_map_sg(); it could only now be a BUG if the driver
> sent invalid pages instead of an error return.

The hope was that we could save doing even that - e.g. if you pass a 
dodgy page into dma_map_page(), maybe page_to_phys() will crash, maybe 
you'll just end up with a DMA address that won't work, but either way it 
doesn't care in its own right - but it seems that's moot.

>>> +                ret = -EREMOTEIO;
>>> +                goto out_restore_sg;
>>> +            }
>>> +
>>> +            if (map) {
>>> +                s->length = 0;
>>
>> I'm not really thrilled about the idea of passing zero-length segments
>> to iommu_map_sg(). Yes, it happens to trick the concatenation logic in
>> the current implementation into doing what you want, but it feels fragile.
> 
> We're not passing zero length segments to iommu_map_sg() (or any
> function). This loop is just scanning to calculate the length of the
> required IOVA. __finalise_sg() (which is intimately tied to this loop)
> then needs a way to determine which segments were P2P segments. The
> existing code already overwrites s->length with an aligned length and
> stores the original length in sg_dma_len. So we're not relying on
> tricking any logic here.

Yes, we temporarily shuffle in page-aligned quantities to satisfy the 
needs of the iommu_map_sg() call, before unpacking things again in 
__finalise_sg(). It's some disgusting trickery that I'm particularly 
proud of. My point is that if you have a mix of both p2p and normal 
segments - which seems to be a case you want to support - then the 
length of 0 that you set to flag p2p segments here will be seen by 
iommu_map_sg() (as it walks the list to map the other segments) before 
you then use it as a key to override the DMA address in the final step. 
It's not a concern if you have a p2p-only list and short-circuit 
straight to that step (in which case all the shuffling was wasted effort 
anyway), but since it's not entirely clear what a segment with zero 
length would mean in general, it seems like a good idea to avoid passing 
the list across a public boundary in that state, if possible.

Robin.

>>>    }
>>>      static void iommu_dma_unmap_sg(struct device *dev, struct
>>> scatterlist *sg,
>>>            int nents, enum dma_data_direction dir, unsigned long attrs)
>>>    {
>>> -    dma_addr_t start, end;
>>> +    dma_addr_t end, start = DMA_MAPPING_ERROR;
>>>        struct scatterlist *tmp;
>>>        int i;
>>>    @@ -1054,14 +1090,20 @@ static void iommu_dma_unmap_sg(struct device
>>> *dev, struct scatterlist *sg,
>>>         * The scatterlist segments are mapped into a single
>>>         * contiguous IOVA allocation, so this is incredibly easy.
>>>         */
>>> -    start = sg_dma_address(sg);
>>> -    for_each_sg(sg_next(sg), tmp, nents - 1, i) {
>>> +    for_each_sg(sg, tmp, nents, i) {
>>> +        if (sg_is_pci_p2pdma(tmp))
>>
>> Since the flag is associated with the DMA address which will no longer
>> be valid, shouldn't it be cleared? The circumstances in which leaving it
>> around could cause a problem are tenuous, but definitely possible.
> 
> Yes, that's a good idea.
> 
> Thanks for the review!
> 
> Logan
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 08/11] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
  2021-03-12 19:47       ` Robin Murphy
@ 2021-03-12 20:06         ` Logan Gunthorpe
  0 siblings, 0 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-12 20:06 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-nvme, linux-block, linux-pci,
	linux-mm, iommu
  Cc: Minturn Dave B, John Hubbard, Dave Hansen, Ira Weiny,
	Matthew Wilcox, Christian König, Jason Gunthorpe,
	Jason Ekstrand, Daniel Vetter, Dan Williams, Stephen Bates,
	Jakowski Andrzej, Christoph Hellwig, Xiong Jianxin



On 2021-03-12 12:47 p.m., Robin Murphy wrote:
>>>>    {
>>>>        struct scatterlist *s, *cur = sg;
>>>>        unsigned long seg_mask = dma_get_seg_boundary(dev);
>>>> @@ -864,6 +865,20 @@ static int __finalise_sg(struct device *dev,
>>>> struct scatterlist *sg, int nents,
>>>>            sg_dma_address(s) = DMA_MAPPING_ERROR;
>>>>            sg_dma_len(s) = 0;
>>>>    +        if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
>>>> +            if (i > 0)
>>>> +                cur = sg_next(cur);
>>>> +
>>>> +            sg_dma_address(cur) = sg_phys(s) + s->offset -
>>>
>>> Are you sure about that? ;)
>>
>> Do you see a bug? I don't follow you...
> 
> sg_phys() already accounts for the offset, so you're adding it twice.

Ah, oops. Nice catch. I missed that.

> 
>>>> +                pci_p2pdma_bus_offset(sg_page(s));
>>>
>>> Can the bus offset make P2P addresses overlap with regions of mem space
>>> that we might use for regular IOVA allocation? That would be very bad...
>>
>> No. IOMMU drivers already disallow all PCI addresses from being used as
>> IOVA addresses. See, for example,  dmar_init_reserved_ranges(). It would
>> be a huge problem for a whole lot of other reasons if it didn't.
> 
> I know we reserve the outbound windows (largely *because* some host 
> bridges will consider those addresses as attempts at unsupported P2P and 
> prevent them working), I just wanted to confirm that this bus offset is 
> always something small that stays within the relevant window, rather 
> than something that might make a BAR appear in a completely different 
> place for P2P purposes. If so, that's good.

Yes, well if an IOVA overlaps with any PCI bus address there's going to 
be some difficult brokenness because when the IOVA is used it might be 
directed to the a PCI device and not the IOMMU. I fixed a bug like that 
once.
>>> I'm not really thrilled about the idea of passing zero-length segments
>>> to iommu_map_sg(). Yes, it happens to trick the concatenation logic in
>>> the current implementation into doing what you want, but it feels 
>>> fragile.
>>
>> We're not passing zero length segments to iommu_map_sg() (or any
>> function). This loop is just scanning to calculate the length of the
>> required IOVA. __finalise_sg() (which is intimately tied to this loop)
>> then needs a way to determine which segments were P2P segments. The
>> existing code already overwrites s->length with an aligned length and
>> stores the original length in sg_dma_len. So we're not relying on
>> tricking any logic here.
> 
> Yes, we temporarily shuffle in page-aligned quantities to satisfy the 
> needs of the iommu_map_sg() call, before unpacking things again in 
> __finalise_sg(). It's some disgusting trickery that I'm particularly 
> proud of. My point is that if you have a mix of both p2p and normal 
> segments - which seems to be a case you want to support - then the 
> length of 0 that you set to flag p2p segments here will be seen by 
> iommu_map_sg() (as it walks the list to map the other segments) before 
> you then use it as a key to override the DMA address in the final step. 
> It's not a concern if you have a p2p-only list and short-circuit 
> straight to that step (in which case all the shuffling was wasted effort 
> anyway), but since it's not entirely clear what a segment with zero 
> length would mean in general, it seems like a good idea to avoid passing 
> the list across a public boundary in that state, if possible.

Ok, well, I mean the iommu_map_sg() does the right thing as is without 
changing it and IMO sg->length set to zero does make sense. Supporting 
mixed P2P and normal segments is really the whole point of this series 
(the current kernel supports homogeneous SGLs with a specialized path -- 
see pci_p2pdma_unmap_sg_attrs()). But do you have an alternate solution 
for sg->length?

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 01/11] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()
  2021-03-11 23:31 ` [RFC PATCH v2 01/11] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn() Logan Gunthorpe
@ 2021-03-12 20:39   ` Bjorn Helgaas
  2021-03-12 20:53     ` Logan Gunthorpe
  0 siblings, 1 reply; 47+ messages in thread
From: Bjorn Helgaas @ 2021-03-12 20:39 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin

On Thu, Mar 11, 2021 at 04:31:31PM -0700, Logan Gunthorpe wrote:
> In order to call this function from a dma_map function, it must not sleep.
> The only reason it does sleep so to allocate the seqbuf to print
> which devices are within the ACS path.

s/this function/upstream_bridge_distance_warn()/ ?
s/so to/is to/

Maybe the subject could say something about the purpose, e.g., allow
calling from atomic context or something?  "Pass gfp_mask flags" sort
of restates what we can read from the patch, but without the
motivation of why this is useful.

> Switch the kmalloc call to use a passed in gfp_mask  and don't print that
> message if the buffer fails to be allocated.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/p2pdma.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 196382630363..bd89437faf06 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -267,7 +267,7 @@ static int pci_bridge_has_acs_redir(struct pci_dev *pdev)
>  
>  static void seq_buf_print_bus_devfn(struct seq_buf *buf, struct pci_dev *pdev)
>  {
> -	if (!buf)
> +	if (!buf || !buf->buffer)
>  		return;
>  
>  	seq_buf_printf(buf, "%s;", pci_name(pdev));
> @@ -495,25 +495,26 @@ upstream_bridge_distance(struct pci_dev *provider, struct pci_dev *client,
>  
>  static enum pci_p2pdma_map_type
>  upstream_bridge_distance_warn(struct pci_dev *provider, struct pci_dev *client,
> -			      int *dist)
> +			      int *dist, gfp_t gfp_mask)
>  {
>  	struct seq_buf acs_list;
>  	bool acs_redirects;
>  	int ret;
>  
> -	seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, GFP_KERNEL), PAGE_SIZE);
> -	if (!acs_list.buffer)
> -		return -ENOMEM;
> +	seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, gfp_mask), PAGE_SIZE);
>  
>  	ret = upstream_bridge_distance(provider, client, dist, &acs_redirects,
>  				       &acs_list);
>  	if (acs_redirects) {
>  		pci_warn(client, "ACS redirect is set between the client and provider (%s)\n",
>  			 pci_name(provider));
> -		/* Drop final semicolon */
> -		acs_list.buffer[acs_list.len-1] = 0;
> -		pci_warn(client, "to disable ACS redirect for this path, add the kernel parameter: pci=disable_acs_redir=%s\n",
> -			 acs_list.buffer);
> +
> +		if (acs_list.buffer) {
> +			/* Drop final semicolon */
> +			acs_list.buffer[acs_list.len - 1] = 0;
> +			pci_warn(client, "to disable ACS redirect for this path, add the kernel parameter: pci=disable_acs_redir=%s\n",
> +				 acs_list.buffer);
> +		}
>  	}
>  
>  	if (ret == PCI_P2PDMA_MAP_NOT_SUPPORTED) {
> @@ -566,7 +567,7 @@ int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
>  
>  		if (verbose)
>  			ret = upstream_bridge_distance_warn(provider,
> -					pci_client, &distance);
> +					pci_client, &distance, GFP_KERNEL);
>  		else
>  			ret = upstream_bridge_distance(provider, pci_client,
>  						       &distance, NULL, NULL);
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 01/11] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()
  2021-03-12 20:39   ` Bjorn Helgaas
@ 2021-03-12 20:53     ` Logan Gunthorpe
  0 siblings, 0 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-12 20:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin



On 2021-03-12 1:39 p.m., Bjorn Helgaas wrote:
> On Thu, Mar 11, 2021 at 04:31:31PM -0700, Logan Gunthorpe wrote:
>> In order to call this function from a dma_map function, it must not sleep.
>> The only reason it does sleep so to allocate the seqbuf to print
>> which devices are within the ACS path.
> 
> s/this function/upstream_bridge_distance_warn()/ ?
> s/so to/is to/
> 
> Maybe the subject could say something about the purpose, e.g., allow
> calling from atomic context or something?  "Pass gfp_mask flags" sort
> of restates what we can read from the patch, but without the
> motivation of why this is useful.
> 
>> Switch the kmalloc call to use a passed in gfp_mask  and don't print that
>> message if the buffer fails to be allocated.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Thanks! I'll apply these changes for any future postings.

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 02/11] PCI/P2PDMA: Avoid pci_get_slot() which sleeps
  2021-03-11 23:31 ` [RFC PATCH v2 02/11] PCI/P2PDMA: Avoid pci_get_slot() which sleeps Logan Gunthorpe
@ 2021-03-12 20:57   ` Bjorn Helgaas
  2021-03-12 21:37     ` Logan Gunthorpe
  0 siblings, 1 reply; 47+ messages in thread
From: Bjorn Helgaas @ 2021-03-12 20:57 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin

On Thu, Mar 11, 2021 at 04:31:32PM -0700, Logan Gunthorpe wrote:
> In order to use upstream_bridge_distance_warn() from a dma_map function,
> it must not sleep. However, pci_get_slot() takes the pci_bus_sem so it
> might sleep.
> 
> In order to avoid this, try to get the host bridge's device from
> bus->self, and if that is not set just get the first element in the
> list. It should be impossible for the host bridges device to go away
> while references are held on child devices, so the first element
> should not change and this should be safe.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/p2pdma.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index bd89437faf06..2135fe69bb07 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -311,11 +311,15 @@ static const struct pci_p2pdma_whitelist_entry {
>  static bool __host_bridge_whitelist(struct pci_host_bridge *host,
>  				    bool same_host_bridge)
>  {
> -	struct pci_dev *root = pci_get_slot(host->bus, PCI_DEVFN(0, 0));
>  	const struct pci_p2pdma_whitelist_entry *entry;
> +	struct pci_dev *root = host->bus->self;
>  	unsigned short vendor, device;
>  
>  	if (!root)
> +		root = list_first_entry_or_null(&host->bus->devices,
> +						struct pci_dev, bus_list);

Replacing one ugliness (assuming there is a pci_dev for the host
bridge, and that it is at 00.0) with another (still assuming a pci_dev
and that it is host->bus->self or the first entry).  I can't suggest
anything better, but maybe a little comment in the code would help
future readers.

I wish we had a real way to discover this property without the
whitelist, at least for future devices.  Was there ever any interest
in a _DSM or similar interface for this?

I *am* very glad to remove a pci_get_slot() usage.

> +
> +	if (!root || root->devfn)
>  		return false;
>  
>  	vendor = root->vendor;

Don't you need to also remove the "pci_dev_put(root)" a few lines
below?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 02/11] PCI/P2PDMA: Avoid pci_get_slot() which sleeps
  2021-03-12 20:57   ` Bjorn Helgaas
@ 2021-03-12 21:37     ` Logan Gunthorpe
  0 siblings, 0 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-12 21:37 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin



On 2021-03-12 1:57 p.m., Bjorn Helgaas wrote:
> On Thu, Mar 11, 2021 at 04:31:32PM -0700, Logan Gunthorpe wrote:
>> In order to use upstream_bridge_distance_warn() from a dma_map function,
>> it must not sleep. However, pci_get_slot() takes the pci_bus_sem so it
>> might sleep.
>>
>> In order to avoid this, try to get the host bridge's device from
>> bus->self, and if that is not set just get the first element in the
>> list. It should be impossible for the host bridges device to go away
>> while references are held on child devices, so the first element
>> should not change and this should be safe.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>> ---
>>  drivers/pci/p2pdma.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index bd89437faf06..2135fe69bb07 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -311,11 +311,15 @@ static const struct pci_p2pdma_whitelist_entry {
>>  static bool __host_bridge_whitelist(struct pci_host_bridge *host,
>>  				    bool same_host_bridge)
>>  {
>> -	struct pci_dev *root = pci_get_slot(host->bus, PCI_DEVFN(0, 0));
>>  	const struct pci_p2pdma_whitelist_entry *entry;
>> +	struct pci_dev *root = host->bus->self;
>>  	unsigned short vendor, device;
>>  
>>  	if (!root)
>> +		root = list_first_entry_or_null(&host->bus->devices,
>> +						struct pci_dev, bus_list);
> 
> Replacing one ugliness (assuming there is a pci_dev for the host
> bridge, and that it is at 00.0) with another (still assuming a pci_dev
> and that it is host->bus->self or the first entry).  I can't suggest
> anything better, but maybe a little comment in the code would help
> future readers.

Yeah, I struggled to find a solution here; this was the best I could
come up with. I'd love it if someone had a better idea. I can add a
comment for future iterations.

> I wish we had a real way to discover this property without the
> whitelist, at least for future devices.  Was there ever any interest
> in a _DSM or similar interface for this?

I'd also like to get rid of the whitelist, but I have no idea how or who
would have to lead a fight to get the hardware to self describe in way
that we could use.

> I *am* very glad to remove a pci_get_slot() usage.
> 
>> +
>> +	if (!root || root->devfn)
>>  		return false;
>>  
>>  	vendor = root->vendor;
> 
> Don't you need to also remove the "pci_dev_put(root)" a few lines
> below?

Yes, right. Good catch!

Logan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset()
  2021-03-11 23:31 ` [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset() Logan Gunthorpe
@ 2021-03-13  1:38   ` Ira Weiny
  2021-03-15 16:27     ` Logan Gunthorpe
  2021-03-13  2:32   ` Ira Weiny
  2021-03-16  8:14   ` Christoph Hellwig
  2 siblings, 1 reply; 47+ messages in thread
From: Ira Weiny @ 2021-03-13  1:38 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin

On Thu, Mar 11, 2021 at 04:31:34PM -0700, Logan Gunthorpe wrote:
> Introduce pci_p2pdma_should_map_bus() which is meant to be called by
> DMA map functions to determine how to map a given p2pdma page.
> 
> pci_p2pdma_bus_offset() is also added to allow callers to get the bus
> offset if they need to map the bus address.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/p2pdma.c       | 50 ++++++++++++++++++++++++++++++++++++++
>  include/linux/pci-p2pdma.h | 11 +++++++++
>  2 files changed, 61 insertions(+)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 7f6836732bce..66d16b7eb668 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -912,6 +912,56 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>  
> +/**
> + * pci_p2pdma_bus_offset - returns the bus offset for a given page
> + * @page: page to get the offset for
> + *
> + * Must be passed a PCI p2pdma page.
> + */
> +u64 pci_p2pdma_bus_offset(struct page *page)
> +{
> +	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page->pgmap);
> +
> +	WARN_ON(!is_pci_p2pdma_page(page));

Shouldn't this check be before the to_p2p_pgmap() call?  And I've been told not
to introduce WARN_ON's.  Should this be?

	if (!is_pci_p2pdma_page(page))
		return -1;

???

Ira

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset()
  2021-03-11 23:31 ` [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset() Logan Gunthorpe
  2021-03-13  1:38   ` Ira Weiny
@ 2021-03-13  2:32   ` Ira Weiny
  2021-03-15 16:30     ` Logan Gunthorpe
  2021-03-16  8:14   ` Christoph Hellwig
  2 siblings, 1 reply; 47+ messages in thread
From: Ira Weiny @ 2021-03-13  2:32 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin

On Thu, Mar 11, 2021 at 04:31:34PM -0700, Logan Gunthorpe wrote:
> Introduce pci_p2pdma_should_map_bus() which is meant to be called by
            ^^^^^^^^^^^^^^^^^^^^^^^^^
	    pci_p2pdma_dma_map_type() ???

FWIW I find this name confusing with pci_p2pdma_map_type() and looking at the
implementation I'm not clear why pci_p2pdma_dma_map_type() needs to exist?

Ira

[snip]

> +
> +/**
> + * pci_p2pdma_dma_map_type - determine if a DMA mapping should use the
> + *	bus address, be mapped normally or fail
> + * @dev: device doing the DMA request
> + * @pgmap: dev_pagemap structure for the mapping
> + *
> + * Returns:
> + *    1 - if the page should be mapped with a bus address,
> + *    0 - if the page should be mapped normally through an IOMMU mapping or
> + *        physical address; or
> + *   -1 - if the device should not map the pages and an error should be
> + *        returned
> + */
> +int pci_p2pdma_dma_map_type(struct device *dev, struct dev_pagemap *pgmap)
> +{
> +	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(pgmap);
> +	struct pci_dev *client;
> +
> +	if (!dev_is_pci(dev))
> +		return -1;
> +
> +	client = to_pci_dev(dev);
> +
> +	switch (pci_p2pdma_map_type(p2p_pgmap->provider, client)) {
> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> +		return 0;
> +	case PCI_P2PDMA_MAP_BUS_ADDR:
> +		return 1;
> +	default:
> +		return -1;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_dma_map_type);

I guess the main point here is to export this to the DMA layer?

Ira

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 07/11] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
  2021-03-11 23:31 ` [RFC PATCH v2 07/11] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
@ 2021-03-13  2:36   ` Ira Weiny
  2021-03-15 16:33     ` Logan Gunthorpe
  2021-03-16  8:15   ` Christoph Hellwig
  1 sibling, 1 reply; 47+ messages in thread
From: Ira Weiny @ 2021-03-13  2:36 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin

On Thu, Mar 11, 2021 at 04:31:37PM -0700, Logan Gunthorpe wrote:
 
> +int dma_pci_p2pdma_supported(struct device *dev)
   ^^^
  bool?

> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> +	return !ops || ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;

Is this logic correct?  I would have expected.

	return (ops && ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED);

Ira

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset()
  2021-03-13  1:38   ` Ira Weiny
@ 2021-03-15 16:27     ` Logan Gunthorpe
  2021-03-24 17:21       ` Jason Gunthorpe
  0 siblings, 1 reply; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-15 16:27 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin



On 2021-03-12 6:38 p.m., Ira Weiny wrote:
> On Thu, Mar 11, 2021 at 04:31:34PM -0700, Logan Gunthorpe wrote:
>> Introduce pci_p2pdma_should_map_bus() which is meant to be called by
>> DMA map functions to determine how to map a given p2pdma page.
>>
>> pci_p2pdma_bus_offset() is also added to allow callers to get the bus
>> offset if they need to map the bus address.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>> ---
>>  drivers/pci/p2pdma.c       | 50 ++++++++++++++++++++++++++++++++++++++
>>  include/linux/pci-p2pdma.h | 11 +++++++++
>>  2 files changed, 61 insertions(+)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index 7f6836732bce..66d16b7eb668 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -912,6 +912,56 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>>  }
>>  EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>>  
>> +/**
>> + * pci_p2pdma_bus_offset - returns the bus offset for a given page
>> + * @page: page to get the offset for
>> + *
>> + * Must be passed a PCI p2pdma page.
>> + */
>> +u64 pci_p2pdma_bus_offset(struct page *page)
>> +{
>> +	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page->pgmap);
>> +
>> +	WARN_ON(!is_pci_p2pdma_page(page));
> 
> Shouldn't this check be before the to_p2p_pgmap() call?  

The to_p2p_pgmap() call is just doing pointer arithmetic, so strictly
speaking it doesn't need to be before. We just can't access p2p_pgmap
until it has been checked.

> And I've been told not
> to introduce WARN_ON's.  Should this be?
> 
> 	if (!is_pci_p2pdma_page(page))
> 		return -1;

In this case the WARN_ON is just to guard against misuse of the
function. It should never happen unless a developer changes the code in
a way that is incorrect. So I think that's the correct use of WARN_ON.
Though I might change it to WARN and return, that seems safer.

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset()
  2021-03-13  2:32   ` Ira Weiny
@ 2021-03-15 16:30     ` Logan Gunthorpe
  0 siblings, 0 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-15 16:30 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin



On 2021-03-12 7:32 p.m., Ira Weiny wrote:
> On Thu, Mar 11, 2021 at 04:31:34PM -0700, Logan Gunthorpe wrote:
>> Introduce pci_p2pdma_should_map_bus() which is meant to be called by
>             ^^^^^^^^^^^^^^^^^^^^^^^^^
> 	    pci_p2pdma_dma_map_type() ???
> 
> FWIW I find this name confusing with pci_p2pdma_map_type() and looking at the
> implementation I'm not clear why pci_p2pdma_dma_map_type() needs to exist?

Yeah, there are subtle differences in prototype. But yes, they can
probably be combined. Will do for future postings.

>> + * pci_p2pdma_dma_map_type - determine if a DMA mapping should use the
>> + *	bus address, be mapped normally or fail
>> + * @dev: device doing the DMA request
>> + * @pgmap: dev_pagemap structure for the mapping
>> + *
>> + * Returns:
>> + *    1 - if the page should be mapped with a bus address,
>> + *    0 - if the page should be mapped normally through an IOMMU mapping or
>> + *        physical address; or
>> + *   -1 - if the device should not map the pages and an error should be
>> + *        returned
>> + */
>> +int pci_p2pdma_dma_map_type(struct device *dev, struct dev_pagemap *pgmap)
>> +{
>> +	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(pgmap);
>> +	struct pci_dev *client;
>> +
>> +	if (!dev_is_pci(dev))
>> +		return -1;
>> +
>> +	client = to_pci_dev(dev);
>> +
>> +	switch (pci_p2pdma_map_type(p2p_pgmap->provider, client)) {
>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>> +		return 0;
>> +	case PCI_P2PDMA_MAP_BUS_ADDR:
>> +		return 1;
>> +	default:
>> +		return -1;
>> +	}
>> +}
>> +EXPORT_SYMBOL_GPL(pci_p2pdma_dma_map_type);
> 
> I guess the main point here is to export this to the DMA layer?

Yes, that's correct.

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 07/11] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
  2021-03-13  2:36   ` Ira Weiny
@ 2021-03-15 16:33     ` Logan Gunthorpe
  2021-03-16  8:00       ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-15 16:33 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin



On 2021-03-12 7:36 p.m., Ira Weiny wrote:
> On Thu, Mar 11, 2021 at 04:31:37PM -0700, Logan Gunthorpe wrote:
>  
>> +int dma_pci_p2pdma_supported(struct device *dev)
>    ^^^
>   bool?

Sure.

> 
>> +{
>> +	const struct dma_map_ops *ops = get_dma_ops(dev);
>> +
>> +	return !ops || ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
> 
> Is this logic correct?  I would have expected.
> 
> 	return (ops && ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED);


If ops is NULL then the operations in kernel/dma/direct.c are used and
support is added to those in patch 6. So it is correct as written.

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  2021-03-12 18:11       ` Robin Murphy
  2021-03-12 18:27         ` Logan Gunthorpe
@ 2021-03-16  7:56         ` Christoph Hellwig
  1 sibling, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2021-03-16  7:56 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu, Minturn Dave B, John Hubbard,
	Dave Hansen, Matthew Wilcox, Christian König,
	Jason Gunthorpe, Jason Ekstrand, Daniel Vetter, Dan Williams,
	Stephen Bates, Jakowski Andrzej, Christoph Hellwig,
	Xiong Jianxin

On Fri, Mar 12, 2021 at 06:11:17PM +0000, Robin Murphy wrote:
> Sure, that's how things stand immediately after this patch. But then 
> someone comes along with the perfectly reasonable argument for returning 
> more expressive error information for regular mapping failures as well 
> (because sometimes those can be terminal too, as above), we start to get 
> divergent behaviour across architectures and random bits of old code subtly 
> breaking down the line. *That* is what makes me wary of making a 
> fundamental change to a long-standing "nonzero means success" interface...

Agreed.  IMHO dma_map_sg actually needs to be switched to return
unsigned to help root this out, going the other way is no helpful.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  2021-03-12 18:27         ` Logan Gunthorpe
@ 2021-03-16  7:58           ` Christoph Hellwig
  2021-03-16 15:54             ` Logan Gunthorpe
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2021-03-16  7:58 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Robin Murphy, linux-kernel, linux-nvme, linux-block, linux-pci,
	linux-mm, iommu, Minturn Dave B, John Hubbard, Dave Hansen,
	Matthew Wilcox, Christian König, Jason Gunthorpe,
	Jason Ekstrand, Daniel Vetter, Dan Williams, Stephen Bates,
	Jakowski Andrzej, Christoph Hellwig, Xiong Jianxin

On Fri, Mar 12, 2021 at 11:27:46AM -0700, Logan Gunthorpe wrote:
> So then we reject the patches that make that change. Seems like an odd
> argument to say that we can't do something that won't cause problems
> because someone might use it as an example and do something that will
> cause problems. Reject the change that causes the problem.

No, the problem is a mess of calling conventions.  A calling convention
returning 0 for error, positive values for success is fine.  One returning
a negative errno for error and positive values for success is fine a well.
One returning 0 for the usual errors and negativ errnos for an unusual
corner case is just a complete mess.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 07/11] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
  2021-03-15 16:33     ` Logan Gunthorpe
@ 2021-03-16  8:00       ` Christoph Hellwig
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2021-03-16  8:00 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Ira Weiny, linux-kernel, linux-nvme, linux-block, linux-pci,
	linux-mm, iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin

On Mon, Mar 15, 2021 at 10:33:13AM -0600, Logan Gunthorpe wrote:
> >> +	return !ops || ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
> > 
> > Is this logic correct?  I would have expected.
> > 
> > 	return (ops && ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED);
> 
> 
> If ops is NULL then the operations in kernel/dma/direct.c are used and
> support is added to those in patch 6. So it is correct as written.

It is not quite that easy. There also is the bypass flag and for the
specific case where that is ignored the code needs a really good
comment.  And to assist that formatted so that it makes sense.  The
above line is indeed highly confusing even if it ends up being correct.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 09/11] block: Add BLK_STS_P2PDMA
  2021-03-11 23:31 ` [RFC PATCH v2 09/11] block: Add BLK_STS_P2PDMA Logan Gunthorpe
@ 2021-03-16  8:00   ` Christoph Hellwig
  2021-03-16 16:02     ` Logan Gunthorpe
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2021-03-16  8:00 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin

On Thu, Mar 11, 2021 at 04:31:39PM -0700, Logan Gunthorpe wrote:
> Create a specific error code for when P2PDMA pages are passed to a block
> devices that cannot map them (due to no IOMMU support or ACS protections).
> 
> This makes request errors in these cases more informative of as to what
> caused the error.

I really don't think we should bother with a specific error code here,
we don't add a new status for every single possible logic error in the
caller.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  2021-03-11 23:31 ` [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
  2021-03-12 15:52   ` Robin Murphy
@ 2021-03-16  8:11   ` Christoph Hellwig
  1 sibling, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2021-03-16  8:11 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin

On Thu, Mar 11, 2021 at 04:31:36PM -0700, Logan Gunthorpe wrote:
>  	for_each_sg(sgl, sg, nents, i) {
> +		if (is_pci_p2pdma_page(sg_page(sg))) {
> +			if (sg_page(sg)->pgmap != pgmap) {
> +				pgmap = sg_page(sg)->pgmap;
> +				map = pci_p2pdma_dma_map_type(dev, pgmap);
> +				bus_off = pci_p2pdma_bus_offset(sg_page(sg));
> +			}
> +
> +			if (map < 0) {
> +				sg->dma_address = DMA_MAPPING_ERROR;
> +				ret = -EREMOTEIO;
> +				goto out_unmap;
> +			}
> +
> +			if (map) {
> +				sg->dma_address = sg_phys(sg) + sg->offset -
> +					bus_off;
> +				sg_dma_len(sg) = sg->length;
> +				sg_mark_pci_p2pdma(sg);
> +				continue;
> +			}
> +		}

This code needs to go into a separate noinline helper to reduce the impact
on the fast path.  Also as Robin noted the offset is already
accounted for in sg_phys.  We also don't ever set the dma_address in
the scatterlist to DMA_MAPPING_ERROR, that is just a return value
for the single entry mapping routines.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset()
  2021-03-11 23:31 ` [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset() Logan Gunthorpe
  2021-03-13  1:38   ` Ira Weiny
  2021-03-13  2:32   ` Ira Weiny
@ 2021-03-16  8:14   ` Christoph Hellwig
  2 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2021-03-16  8:14 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin

On Thu, Mar 11, 2021 at 04:31:34PM -0700, Logan Gunthorpe wrote:
> Introduce pci_p2pdma_should_map_bus() which is meant to be called by
> DMA map functions to determine how to map a given p2pdma page.
> 
> pci_p2pdma_bus_offset() is also added to allow callers to get the bus
> offset if they need to map the bus address.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/p2pdma.c       | 50 ++++++++++++++++++++++++++++++++++++++
>  include/linux/pci-p2pdma.h | 11 +++++++++
>  2 files changed, 61 insertions(+)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 7f6836732bce..66d16b7eb668 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -912,6 +912,56 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>  
> +/**
> + * pci_p2pdma_bus_offset - returns the bus offset for a given page
> + * @page: page to get the offset for
> + *
> + * Must be passed a PCI p2pdma page.
> + */
> +u64 pci_p2pdma_bus_offset(struct page *page)
> +{
> +	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page->pgmap);
> +
> +	WARN_ON(!is_pci_p2pdma_page(page));
> +
> +	return p2p_pgmap->bus_offset;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_bus_offset);

I don't see why this would be exported.  

> +EXPORT_SYMBOL_GPL(pci_p2pdma_dma_map_type);

Same here.  Also the two helpers don't really look very related.

I suspect you really want to move the p2p handling bits currenly added
to kernel/dma/direct.c into drivers/pci/p2pdma.c instead, as that will
allow direct access to the pci_p2pdma_pagemap and should thus simplify
things quite a bit.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 07/11] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
  2021-03-11 23:31 ` [RFC PATCH v2 07/11] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
  2021-03-13  2:36   ` Ira Weiny
@ 2021-03-16  8:15   ` Christoph Hellwig
  1 sibling, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2021-03-16  8:15 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin

On Thu, Mar 11, 2021 at 04:31:37PM -0700, Logan Gunthorpe wrote:
> +int dma_pci_p2pdma_supported(struct device *dev)
> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> +	return !ops || ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
> +}
> +EXPORT_SYMBOL(dma_pci_p2pdma_supported);

EXPORT_SYMBOL_GPL like all new DMA APIs.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  2021-03-16  7:58           ` Christoph Hellwig
@ 2021-03-16 15:54             ` Logan Gunthorpe
  0 siblings, 0 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-16 15:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Robin Murphy, linux-kernel, linux-nvme, linux-block, linux-pci,
	linux-mm, iommu, Minturn Dave B, John Hubbard, Dave Hansen,
	Matthew Wilcox, Christian König, Jason Gunthorpe,
	Jason Ekstrand, Daniel Vetter, Dan Williams, Stephen Bates,
	Jakowski Andrzej, Xiong Jianxin



On 2021-03-16 1:58 a.m., Christoph Hellwig wrote:
> On Fri, Mar 12, 2021 at 11:27:46AM -0700, Logan Gunthorpe wrote:
>> So then we reject the patches that make that change. Seems like an odd
>> argument to say that we can't do something that won't cause problems
>> because someone might use it as an example and do something that will
>> cause problems. Reject the change that causes the problem.
> 
> No, the problem is a mess of calling conventions.  A calling convention
> returning 0 for error, positive values for success is fine.  One returning
> a negative errno for error and positive values for success is fine a well.
> One returning 0 for the usual errors and negativ errnos for an unusual
> corner case is just a complete mess.

Fair enough. I can try implementing a dma_map_sg_p2p() roughly as Robin
suggested that has a more reasonable calling convention.

Most of your other feedback seems easy enough so I'll address it in a
future series.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 09/11] block: Add BLK_STS_P2PDMA
  2021-03-16  8:00   ` Christoph Hellwig
@ 2021-03-16 16:02     ` Logan Gunthorpe
  0 siblings, 0 replies; 47+ messages in thread
From: Logan Gunthorpe @ 2021-03-16 16:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin



On 2021-03-16 2:00 a.m., Christoph Hellwig wrote:
> On Thu, Mar 11, 2021 at 04:31:39PM -0700, Logan Gunthorpe wrote:
>> Create a specific error code for when P2PDMA pages are passed to a block
>> devices that cannot map them (due to no IOMMU support or ACS protections).
>>
>> This makes request errors in these cases more informative of as to what
>> caused the error.
> 
> I really don't think we should bother with a specific error code here,
> we don't add a new status for every single possible logic error in the
> caller.

I originally had BLK_STS_IOERR but those errors suggested to people that
the hardware had failed on the request when in fact it was a user error.
I'll try BLK_STS_TARGET unless there's any objection or someone thinks
another error code would make more sense.

Logan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset()
  2021-03-15 16:27     ` Logan Gunthorpe
@ 2021-03-24 17:21       ` Jason Gunthorpe
  2021-03-24 18:34         ` Christian König
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Gunthorpe @ 2021-03-24 17:21 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Ira Weiny, linux-kernel, linux-nvme, linux-block, linux-pci,
	linux-mm, iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin

On Mon, Mar 15, 2021 at 10:27:08AM -0600, Logan Gunthorpe wrote:

> In this case the WARN_ON is just to guard against misuse of the
> function. It should never happen unless a developer changes the code in
> a way that is incorrect. So I think that's the correct use of WARN_ON.
> Though I might change it to WARN and return, that seems safer.

Right, WARN_ON and return is the right pattern for an assertion that
must never happen:

  if (WARN_ON(foo))
      return -1

Linus wants assertions like this to be able to recover. People runing
the 'panic on warn' mode want the kernel to stop if it detects an
internal malfunction.

Jason

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset()
  2021-03-24 17:21       ` Jason Gunthorpe
@ 2021-03-24 18:34         ` Christian König
  0 siblings, 0 replies; 47+ messages in thread
From: Christian König @ 2021-03-24 18:34 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: Ira Weiny, linux-kernel, linux-nvme, linux-block, linux-pci,
	linux-mm, iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Ira Weiny, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin

Am 24.03.21 um 18:21 schrieb Jason Gunthorpe:
> On Mon, Mar 15, 2021 at 10:27:08AM -0600, Logan Gunthorpe wrote:
>
>> In this case the WARN_ON is just to guard against misuse of the
>> function. It should never happen unless a developer changes the code in
>> a way that is incorrect. So I think that's the correct use of WARN_ON.
>> Though I might change it to WARN and return, that seems safer.
> Right, WARN_ON and return is the right pattern for an assertion that
> must never happen:
>
>    if (WARN_ON(foo))
>        return -1
>
> Linus wants assertions like this to be able to recover. People runing
> the 'panic on warn' mode want the kernel to stop if it detects an
> internal malfunction.

The only justification I can see for a "panic on warn" is to prevent 
further data loss or warn early about a crash.

We only use a BUG_ON() when the alternative would be to corrupt something.

Christian.

>
> Jason


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2021-03-24 18:35 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-11 23:31 [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Logan Gunthorpe
2021-03-11 23:31 ` [RFC PATCH v2 01/11] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn() Logan Gunthorpe
2021-03-12 20:39   ` Bjorn Helgaas
2021-03-12 20:53     ` Logan Gunthorpe
2021-03-11 23:31 ` [RFC PATCH v2 02/11] PCI/P2PDMA: Avoid pci_get_slot() which sleeps Logan Gunthorpe
2021-03-12 20:57   ` Bjorn Helgaas
2021-03-12 21:37     ` Logan Gunthorpe
2021-03-11 23:31 ` [RFC PATCH v2 03/11] PCI/P2PDMA: Attempt to set map_type if it has not been set Logan Gunthorpe
2021-03-11 23:31 ` [RFC PATCH v2 04/11] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset() Logan Gunthorpe
2021-03-13  1:38   ` Ira Weiny
2021-03-15 16:27     ` Logan Gunthorpe
2021-03-24 17:21       ` Jason Gunthorpe
2021-03-24 18:34         ` Christian König
2021-03-13  2:32   ` Ira Weiny
2021-03-15 16:30     ` Logan Gunthorpe
2021-03-16  8:14   ` Christoph Hellwig
2021-03-11 23:31 ` [RFC PATCH v2 05/11] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
2021-03-11 23:31 ` [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
2021-03-12 15:52   ` Robin Murphy
2021-03-12 16:24     ` Logan Gunthorpe
2021-03-12 18:11       ` Robin Murphy
2021-03-12 18:27         ` Logan Gunthorpe
2021-03-16  7:58           ` Christoph Hellwig
2021-03-16 15:54             ` Logan Gunthorpe
2021-03-16  7:56         ` Christoph Hellwig
2021-03-16  8:11   ` Christoph Hellwig
2021-03-11 23:31 ` [RFC PATCH v2 07/11] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
2021-03-13  2:36   ` Ira Weiny
2021-03-15 16:33     ` Logan Gunthorpe
2021-03-16  8:00       ` Christoph Hellwig
2021-03-16  8:15   ` Christoph Hellwig
2021-03-11 23:31 ` [RFC PATCH v2 08/11] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg Logan Gunthorpe
2021-03-12 15:52   ` Robin Murphy
2021-03-12 17:03     ` Logan Gunthorpe
2021-03-12 19:47       ` Robin Murphy
2021-03-12 20:06         ` Logan Gunthorpe
2021-03-11 23:31 ` [RFC PATCH v2 09/11] block: Add BLK_STS_P2PDMA Logan Gunthorpe
2021-03-16  8:00   ` Christoph Hellwig
2021-03-16 16:02     ` Logan Gunthorpe
2021-03-11 23:31 ` [RFC PATCH v2 10/11] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA Logan Gunthorpe
2021-03-11 23:31 ` [RFC PATCH v2 11/11] nvme-pci: Convert to using dma_map_sg for p2pdma pages Logan Gunthorpe
2021-03-11 23:59   ` Jason Gunthorpe
2021-03-12  1:37     ` Logan Gunthorpe
2021-03-12 15:51 ` [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA Robin Murphy
2021-03-12 16:18   ` Logan Gunthorpe
2021-03-12 17:46     ` Robin Murphy
2021-03-12 18:24       ` Logan Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).