linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices
@ 2020-11-06 17:00 Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 01/15] PCI/P2PDMA: Don't sleep in upstream_bridge_distance_warn() Logan Gunthorpe
                   ` (14 more replies)
  0 siblings, 15 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

This RFC enables P2PDMA transfers in userspace between NVMe drives using
existing O_DIRECT operations or the NVMe passthrough IOCTL.

This is accomplished by allowing userspace to allocate chunks of any CMB
by mmaping the NVMe ctrl device (Patches 14 and 15). The resulting memory
will be backed by P2P pages and can be passed only to O_DIRECT
operations. A flag is added to GUP() in Patch 10 and Patches 11 through 13
wire this flag up based on whether the block queue indicates P2PDMA
support.

The above is pretty straight forward and (I hope) largely uncontroversial.
However, the one significant problem in all this is that, presently,
pci_p2pdma_map_sg() requires a homogeneous SGL with all P2PDMA pages or
none. Enhancing GUP to support enforcing this rule would require a huge
hack that I don't expect would be all that pallatable. So this RFC takes
the approach of removing the requirement of having a homogeneous SGL.

With the new common dma-iommu infrastructure, this patchset adds
support for P2PDMA pages into dma_map_sg() which will support AMD,
Intel (soon) and dma-direct implementations. (Other IOMMU
implementations would then be unsupported, notably ARM and PowerPC).

The other major blocker is that in order to implement support for
P2PDMA pages in dma_map_sg(), a flag is necessary to determine if a
given dma_addr_t points to P2PDMA memory or to an IOVA so that it can
be unmapped appropriately in dma_unmap_sg(). The (ugly) approach this
RFC takes is to use the top bit in the dma_length field and ensure
callers are prepared for it using a new DMA_ATTR_P2PDMA flag.

I suspect, the ultimate solution to this blocker will be to implement
some kind of new dma_op that doesn't use the SGL. Ideas have been
thrown around in the past for one that maps some kind of novel dma_vec
directly from a bio_vec. This will become a lot easier to implement if
more dma_ops providers get converted to the new dma-iommu
implementation, but this will take time.

Alternative ideas or other feedback welcome.

This series is based on v5.10-rc2 with Lu Baolu's (and Tom Murphy's)
v4 patchset for converting the Intel IOMMU to dma-iommu[1]. A git
branch is available here:

  https://github.com/sbates130272/linux-p2pmem/  p2pdma_user_cmb_rfc

Thanks,

Logan

[1] https://lkml.kernel.org/lkml/20200927063437.13988-1-baolu.lu@linux.intel.com/T/#u.


Logan Gunthorpe (15):
  PCI/P2PDMA: Don't sleep in upstream_bridge_distance_warn()
  PCI/P2PDMA: Attempt to set map_type if it has not been set
  PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and
    pci_p2pdma_bus_offset()
  lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
  iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
  nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
  nvme-pci: Convert to using dma_map_sg for p2pdma pages
  mm: Introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  iov_iter: Introduce iov_iter_get_pages_[alloc_]flags()
  block: Set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
  block: Set FOLL_PCI_P2PDMA in bio_map_user_iov()
  PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  nvme-pci: Allow mmaping the CMB in userspace

 block/bio.c                 |   7 +-
 block/blk-map.c             |   7 +-
 drivers/dax/super.c         |   7 +-
 drivers/iommu/dma-iommu.c   |  63 +++++++++++--
 drivers/nvme/host/core.c    |  14 ++-
 drivers/nvme/host/nvme.h    |   3 +-
 drivers/nvme/host/pci.c     |  50 ++++++----
 drivers/pci/p2pdma.c        | 178 +++++++++++++++++++++++++++++++++---
 include/linux/dma-map-ops.h |   3 +
 include/linux/dma-mapping.h |  16 ++++
 include/linux/memremap.h    |   4 +-
 include/linux/mm.h          |   1 +
 include/linux/pci-p2pdma.h  |  17 ++++
 include/linux/scatterlist.h |   4 +
 include/linux/uio.h         |  21 ++++-
 kernel/dma/direct.c         |  33 ++++++-
 kernel/dma/mapping.c        |   8 ++
 lib/iov_iter.c              |  25 ++---
 mm/gup.c                    |  28 +++---
 mm/huge_memory.c            |   8 +-
 mm/memory-failure.c         |   4 +-
 mm/memremap.c               |  14 ++-
 22 files changed, 427 insertions(+), 88 deletions(-)


base-commit: 5ba8a2512e8c5f5cf9b7309dc895612f0a77a399
--
2.20.1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 01/15] PCI/P2PDMA: Don't sleep in upstream_bridge_distance_warn()
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-09  9:12   ` Christoph Hellwig
  2020-11-06 17:00 ` [RFC PATCH 02/15] PCI/P2PDMA: Attempt to set map_type if it has not been set Logan Gunthorpe
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

In order to call this function from a dma_map function, it must not sleep.
The only reason it does sleep so to allocate the seqbuf to print
which devices are within the ACS path.

Switch the kmalloc call to use GFP_NOWAIT and simply not print that
message if the buffer fails to be allocated.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index de1c331dbed4..94583779c36e 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -267,7 +267,7 @@ static int pci_bridge_has_acs_redir(struct pci_dev *pdev)
 
 static void seq_buf_print_bus_devfn(struct seq_buf *buf, struct pci_dev *pdev)
 {
-	if (!buf)
+	if (!buf || !buf->buffer)
 		return;
 
 	seq_buf_printf(buf, "%s;", pci_name(pdev));
@@ -501,19 +501,20 @@ upstream_bridge_distance_warn(struct pci_dev *provider, struct pci_dev *client,
 	bool acs_redirects;
 	int ret;
 
-	seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, GFP_KERNEL), PAGE_SIZE);
-	if (!acs_list.buffer)
-		return -ENOMEM;
+	seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, GFP_NOWAIT), PAGE_SIZE);
 
 	ret = upstream_bridge_distance(provider, client, dist, &acs_redirects,
 				       &acs_list);
 	if (acs_redirects) {
 		pci_warn(client, "ACS redirect is set between the client and provider (%s)\n",
 			 pci_name(provider));
-		/* Drop final semicolon */
-		acs_list.buffer[acs_list.len-1] = 0;
-		pci_warn(client, "to disable ACS redirect for this path, add the kernel parameter: pci=disable_acs_redir=%s\n",
-			 acs_list.buffer);
+
+		if (acs_list.buffer) {
+			/* Drop final semicolon */
+			acs_list.buffer[acs_list.len - 1] = 0;
+			pci_warn(client, "to disable ACS redirect for this path, add the kernel parameter: pci=disable_acs_redir=%s\n",
+				 acs_list.buffer);
+		}
 	}
 
 	if (ret == PCI_P2PDMA_MAP_NOT_SUPPORTED) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 02/15] PCI/P2PDMA: Attempt to set map_type if it has not been set
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 01/15] PCI/P2PDMA: Don't sleep in upstream_bridge_distance_warn() Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 03/15] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset() Logan Gunthorpe
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

Attempt to find the mapping type for P2PDMA pages on the first
DMA map attempt if it has not been done ahead of time.

Previously, the mapping type was expected to be calculated ahead of
time, but if pages are to come from userspace then there's no
way to ensure the path was checked ahead of time.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 94583779c36e..ea8472278b11 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -824,11 +824,17 @@ EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct pci_dev *provider,
 						    struct pci_dev *client)
 {
+	enum pci_p2pdma_map_type ret;
+
 	if (!provider->p2pdma)
 		return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 
-	return xa_to_value(xa_load(&provider->p2pdma->map_types,
-				   map_types_idx(client)));
+	ret = xa_to_value(xa_load(&provider->p2pdma->map_types,
+				  map_types_idx(client)));
+	if (ret != PCI_P2PDMA_MAP_UNKNOWN)
+		return ret;
+
+	return upstream_bridge_distance_warn(provider, client, NULL);
 }
 
 static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
@@ -890,7 +896,6 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
 	case PCI_P2PDMA_MAP_BUS_ADDR:
 		return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
 	default:
-		WARN_ON_ONCE(1);
 		return 0;
 	}
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 03/15] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset()
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 01/15] PCI/P2PDMA: Don't sleep in upstream_bridge_distance_warn() Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 02/15] PCI/P2PDMA: Attempt to set map_type if it has not been set Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-10 23:25   ` Bjorn Helgaas
  2020-11-06 17:00 ` [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

Introduce pci_p2pdma_should_map_bus() which is meant to be called by
dma map functions to determine how to map a given p2pdma page.

pci_p2pdma_bus_offset() is also added to allow callers to get the bus
offset if they need to map the bus address.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c       | 46 ++++++++++++++++++++++++++++++++++++++
 include/linux/pci-p2pdma.h | 11 +++++++++
 2 files changed, 57 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ea8472278b11..9961e779f430 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -930,6 +930,52 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
 
+/**
+ * pci_p2pdma_bus_offset - returns the bus offset for a given page
+ * @page: page to get the offset for
+ *
+ * Must be passed a pci p2pdma page.
+ */
+u64 pci_p2pdma_bus_offset(struct page *page)
+{
+	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page->pgmap);
+
+	WARN_ON(!is_pci_p2pdma_page(page));
+
+	return p2p_pgmap->bus_offset;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_bus_offset);
+
+/**
+ * pci_p2pdma_should_map_bus - determine if a dma mapping should use the
+ *	bus address
+ * @dev: device doing the DMA request
+ * @pgmap: dev_pagemap structure for the mapping
+ *
+ * Returns 1 if the page should be mapped with a bus address, 0 otherwise
+ * and -1 the device should not be mapping P2PDMA pages.
+ */
+int pci_p2pdma_should_map_bus(struct device *dev, struct dev_pagemap *pgmap)
+{
+	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(pgmap);
+	struct pci_dev *client;
+
+	if (!dev_is_pci(dev))
+		return -1;
+
+	client = to_pci_dev(dev);
+
+	switch (pci_p2pdma_map_type(p2p_pgmap->provider, client)) {
+	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+		return 0;
+	case PCI_P2PDMA_MAP_BUS_ADDR:
+		return 1;
+	default:
+		return -1;
+	}
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_should_map_bus);
+
 /**
  * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
  *		to enable p2pdma
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..fc5de47eeac4 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -34,6 +34,8 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs);
 void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs);
+u64 pci_p2pdma_bus_offset(struct page *page);
+int pci_p2pdma_should_map_bus(struct device *dev, struct dev_pagemap *pgmap);
 int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 			    bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -83,6 +85,15 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline u64 pci_p2pdma_bus_offset(struct page *page)
+{
+	return -1;
+}
+static inline int pci_p2pdma_should_map_bus(struct device *dev,
+					    struct dev_pagemap *pgmap)
+{
+	return -1;
+}
 static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
 		struct scatterlist *sg, int nents, enum dma_data_direction dir,
 		unsigned long attrs)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (2 preceding siblings ...)
  2020-11-06 17:00 ` [RFC PATCH 03/15] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset() Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-09  9:12   ` Christoph Hellwig
  2020-11-06 17:00 ` [RFC PATCH 05/15] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

We make use of the top bit of the dma_length to indicate a P2PDMA
segment. Code that uses this will need to use sg_dma_p2pdma_len() to
get the length and ensure no lengths exceed 2GB.

An sg_dma_is_p2pdma() helper is included to check if a particular
segment is p2pdma().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/scatterlist.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 36c47e7e66a2..e738159d56f9 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -39,6 +39,10 @@ struct scatterlist {
 #define sg_dma_len(sg)		((sg)->length)
 #endif
 
+#define SG_P2PDMA_FLAG	(1U << 31)
+#define sg_dma_p2pdma_len(sg)	(sg_dma_len(sg) & ~SG_P2PDMA_FLAG)
+#define sg_dma_is_p2pdma(sg)	(sg_dma_len(sg) & SG_P2PDMA_FLAG)
+
 struct sg_table {
 	struct scatterlist *sgl;	/* the list */
 	unsigned int nents;		/* number of mapped entries */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 05/15] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (3 preceding siblings ...)
  2020-11-06 17:00 ` [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 06/15] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

The DMA_ATTR_P2PDMA flag is added to allow callers to indicate support
for P2PDMA pages. In order for a caller to support P2PDMA pages they
must ensure no segment is greater than 2GB such that the high bit
of the dma length can be used as a flag to indicate a P2PDMA segment.
Such code must then ensure to use sg_dma_p2pdma_len() instead of
sg_dma_len() to filter out the flag.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/dma-mapping.h | 11 +++++++++++
 kernel/dma/direct.c         | 33 +++++++++++++++++++++++++++++++--
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 956151052d45..8d028e15b531 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -61,6 +61,17 @@
  */
 #define DMA_ATTR_PRIVILEGED		(1UL << 9)
 
+/*
+ * DMA_ATTR_P2PDMA: specifies that dma_map_sg() may return p2pdma
+ * bus addresses. Code that specifies this must ensure to
+ * use sg_dma_p2pdma_len() instead of sg_dma_len() as the high
+ * bit of the length will indicate a P2PDMA bus address.
+ *
+ * If this attribute is not set and P2PDMA pages are encountered,
+ * dma_map_sg() will return an error.
+ */
+#define DMA_ATTR_P2PDMA			(1UL << 10)
+
 /*
  * A dma_addr_t can hold any valid DMA or bus address for the platform.  It can
  * be given to a device to use as a DMA source or target.  It is specific to a
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 06c111544f61..2fcb31789436 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -13,6 +13,7 @@
 #include <linux/vmalloc.h>
 #include <linux/set_memory.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include "direct.h"
 
 /*
@@ -387,19 +388,47 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
 	struct scatterlist *sg;
 	int i;
 
-	for_each_sg(sgl, sg, nents, i)
+	for_each_sg(sgl, sg, nents, i) {
+		if (attrs & DMA_ATTR_P2PDMA && sg_dma_is_p2pdma(sg)) {
+			sg_dma_len(sg) &= ~SG_P2PDMA_FLAG;
+			continue;
+		}
+
 		dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
 			     attrs);
+	}
 }
 #endif
 
 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 		enum dma_data_direction dir, unsigned long attrs)
 {
-	int i;
+	struct dev_pagemap *pgmap = NULL;
+	int i, map = -1;
 	struct scatterlist *sg;
+	u64 bus_off;
 
 	for_each_sg(sgl, sg, nents, i) {
+		if (is_pci_p2pdma_page(sg_page(sg))) {
+			if (sg_page(sg)->pgmap != pgmap) {
+				pgmap = sg_page(sg)->pgmap;
+				map = pci_p2pdma_should_map_bus(dev, pgmap);
+				bus_off = pci_p2pdma_bus_offset(sg_page(sg));
+			}
+
+			if (map < 0 || !(attrs & DMA_ATTR_P2PDMA)) {
+				sg->dma_address = DMA_MAPPING_ERROR;
+				goto out_unmap;
+			}
+
+			if (map) {
+				sg->dma_address = sg_phys(sg) + sg->offset -
+					bus_off;
+				sg_dma_len(sg) = sg->length | SG_P2PDMA_FLAG;
+				continue;
+			}
+		}
+
 		sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
 				sg->offset, sg->length, dir, attrs);
 		if (sg->dma_address == DMA_MAPPING_ERROR)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 06/15] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (4 preceding siblings ...)
  2020-11-06 17:00 ` [RFC PATCH 05/15] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 07/15] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg Logan Gunthorpe
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.

Also, add a helper to check if a device supports PCI P2PDMA.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/dma-map-ops.h | 3 +++
 include/linux/dma-mapping.h | 5 +++++
 kernel/dma/mapping.c        | 8 ++++++++
 3 files changed, 16 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index a5f89fc4d6df..8c12dfa43ce1 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -12,6 +12,9 @@
 struct cma;
 
 struct dma_map_ops {
+	unsigned int flags;
+#define DMA_F_PCI_P2PDMA_SUPPORTED     (1 << 0)
+
 	void *(*alloc)(struct device *dev, size_t size,
 			dma_addr_t *dma_handle, gfp_t gfp,
 			unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 8d028e15b531..3646c6ba6a00 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -149,6 +149,7 @@ int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
 		unsigned long attrs);
 bool dma_can_mmap(struct device *dev);
 int dma_supported(struct device *dev, u64 mask);
+int dma_pci_p2pdma_supported(struct device *dev);
 int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
@@ -244,6 +245,10 @@ static inline int dma_supported(struct device *dev, u64 mask)
 {
 	return 0;
 }
+static inline int dma_pci_p2pdma_supported(struct device *dev)
+{
+	return 0;
+}
 static inline int dma_set_mask(struct device *dev, u64 mask)
 {
 	return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 51bb8fa8eb89..192418ef43f8 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -567,6 +567,14 @@ int dma_supported(struct device *dev, u64 mask)
 }
 EXPORT_SYMBOL(dma_supported);
 
+int dma_pci_p2pdma_supported(struct device *dev)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	return !ops || ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
+}
+EXPORT_SYMBOL(dma_pci_p2pdma_supported);
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
 void arch_dma_set_mask(struct device *dev, u64 mask);
 #else
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 07/15] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (5 preceding siblings ...)
  2020-11-06 17:00 ` [RFC PATCH 06/15] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 08/15] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA Logan Gunthorpe
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.

Similar to dma-direct, DMA_ATTR_P2PDMA is used to indicate caller
support seeing the high bit of the dma_length is used as a flag
to indicate P2PDMA segments.

On unmap, P2PDMA segments are skipped over when determining the
start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is
set to DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for
P2PDMA pages.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/iommu/dma-iommu.c | 63 ++++++++++++++++++++++++++++++++-------
 1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 5591d6593583..1c8402474376 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -20,6 +20,7 @@
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/swiotlb.h>
 #include <linux/scatterlist.h>
 #include <linux/vmalloc.h>
@@ -872,13 +873,16 @@ static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
  * segment's start address to avoid concatenating across one.
  */
 static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
-		dma_addr_t dma_addr)
+		dma_addr_t dma_addr, unsigned long attrs)
 {
 	struct scatterlist *s, *cur = sg;
 	unsigned long seg_mask = dma_get_seg_boundary(dev);
 	unsigned int cur_len = 0, max_len = dma_get_max_seg_size(dev);
 	int i, count = 0;
 
+	if (attrs & DMA_ATTR_P2PDMA && max_len >= SG_P2PDMA_FLAG)
+		max_len = SG_P2PDMA_FLAG - 1;
+
 	/*
 	 * The Intel graphic driver is used to assume that the returned
 	 * sg list is not combound. This blocks the efforts of converting
@@ -917,6 +921,19 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
 		sg_dma_address(s) = DMA_MAPPING_ERROR;
 		sg_dma_len(s) = 0;
 
+		if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
+			if (i > 0)
+				cur = sg_next(cur);
+
+			sg_dma_address(cur) = sg_phys(s) + s->offset -
+				pci_p2pdma_bus_offset(sg_page(s));
+			sg_dma_len(cur) = s->length | SG_P2PDMA_FLAG;
+
+			count++;
+			cur_len = 0;
+			continue;
+		}
+
 		/*
 		 * Now fill in the real DMA data. If...
 		 * - there is a valid output segment to append to
@@ -1013,11 +1030,12 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	struct iova_domain *iovad = &cookie->iovad;
 	struct scatterlist *s, *prev = NULL;
+	struct dev_pagemap *pgmap = NULL;
 	int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
 	dma_addr_t iova;
 	size_t iova_len = 0;
 	unsigned long mask = dma_get_seg_boundary(dev);
-	int i;
+	int i, map = -1;
 
 	if (unlikely(iommu_dma_deferred_attach(dev, domain)))
 		return 0;
@@ -1045,6 +1063,21 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		s_length = iova_align(iovad, s_length + s_iova_off);
 		s->length = s_length;
 
+		if (is_pci_p2pdma_page(sg_page(s))) {
+			if (sg_page(s)->pgmap != pgmap) {
+				pgmap = sg_page(s)->pgmap;
+				map = pci_p2pdma_should_map_bus(dev, pgmap);
+			}
+
+			if (map < 0 || !(attrs & DMA_ATTR_P2PDMA))
+				goto out_restore_sg;
+
+			if (map) {
+				s->length = 0;
+				continue;
+			}
+		}
+
 		/*
 		 * Due to the alignment of our single IOVA allocation, we can
 		 * depend on these assumptions about the segment boundary mask:
@@ -1067,6 +1100,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		prev = s;
 	}
 
+	if (!iova_len)
+		return __finalise_sg(dev, sg, nents, 0, attrs);
+
 	iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
 	if (!iova)
 		goto out_restore_sg;
@@ -1078,7 +1114,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	if (iommu_map_sg_atomic(domain, iova, sg, nents, prot) < iova_len)
 		goto out_free_iova;
 
-	return __finalise_sg(dev, sg, nents, iova);
+	return __finalise_sg(dev, sg, nents, iova, attrs);
 
 out_free_iova:
 	iommu_dma_free_iova(cookie, iova, iova_len, NULL);
@@ -1090,7 +1126,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs)
 {
-	dma_addr_t start, end;
+	dma_addr_t end, start = DMA_MAPPING_ERROR;
 	struct scatterlist *tmp;
 	int i;
 
@@ -1106,14 +1142,20 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 	 * The scatterlist segments are mapped into a single
 	 * contiguous IOVA allocation, so this is incredibly easy.
 	 */
-	start = sg_dma_address(sg);
-	for_each_sg(sg_next(sg), tmp, nents - 1, i) {
-		if (sg_dma_len(tmp) == 0)
+	for_each_sg(sg, tmp, nents, i) {
+		if ((attrs & DMA_ATTR_P2PDMA) && sg_dma_is_p2pdma(tmp))
+			continue;
+		if (sg_dma_p2pdma_len(tmp) == 0)
 			break;
-		sg = tmp;
+
+		if (start == DMA_MAPPING_ERROR)
+			start = sg_dma_address(tmp);
+
+		end = sg_dma_address(tmp) + sg_dma_len(tmp);
 	}
-	end = sg_dma_address(sg) + sg_dma_len(sg);
-	__iommu_dma_unmap(dev, start, end - start);
+
+	if (start != DMA_MAPPING_ERROR)
+		__iommu_dma_unmap(dev, start, end - start);
 }
 
 static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
@@ -1334,6 +1376,7 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
 }
 
 static const struct dma_map_ops iommu_dma_ops = {
+	.flags			= DMA_F_PCI_P2PDMA_SUPPORTED,
 	.alloc			= iommu_dma_alloc,
 	.free			= iommu_dma_free,
 	.alloc_pages		= dma_common_alloc_pages,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 08/15] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (6 preceding siblings ...)
  2020-11-06 17:00 ` [RFC PATCH 07/15] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 09/15] nvme-pci: Convert to using dma_map_sg for p2pdma pages Logan Gunthorpe
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
flags can be checked for PCI P2PDMA support.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  2 +-
 drivers/nvme/host/pci.c  | 11 +++++++++--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 376096bfc54a..f14316c9b34a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3824,7 +3824,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
 		blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
 
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
-	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+	if (ctrl->ops->supports_pci_p2pdma &&
+	    ctrl->ops->supports_pci_p2pdma(ctrl))
 		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
 
 	ns->queue->queuedata = ns;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cc111136a981..a0bfe8709cfc 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -465,7 +465,6 @@ struct nvme_ctrl_ops {
 	unsigned int flags;
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
-#define NVME_F_PCI_P2PDMA		(1 << 2)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -473,6 +472,7 @@ struct nvme_ctrl_ops {
 	void (*submit_async_event)(struct nvme_ctrl *ctrl);
 	void (*delete_ctrl)(struct nvme_ctrl *ctrl);
 	int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+	bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
 };
 
 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index df8f3612107f..ef7ce464a48d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2686,17 +2686,24 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 	return snprintf(buf, size, "%s\n", dev_name(&pdev->dev));
 }
 
+static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
+{
+	struct nvme_dev *dev = to_nvme_dev(ctrl);
+
+	return dma_pci_p2pdma_supported(dev->dev);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
-	.flags			= NVME_F_METADATA_SUPPORTED |
-				  NVME_F_PCI_P2PDMA,
+	.flags			= NVME_F_METADATA_SUPPORTED,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
 	.free_ctrl		= nvme_pci_free_ctrl,
 	.submit_async_event	= nvme_pci_submit_async_event,
 	.get_address		= nvme_pci_get_address,
+	.supports_pci_p2pdma	= nvme_pci_supports_pci_p2pdma,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 09/15] nvme-pci: Convert to using dma_map_sg for p2pdma pages
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (7 preceding siblings ...)
  2020-11-06 17:00 ` [RFC PATCH 08/15] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 10/15] mm: Introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

Switch to using sg_dma_p2pdma_len() in places where sg_dma_len() is
used. Then replace the calls to pci_p2pdma_[un]map_sg() with calls
to dma_[un]map_sg() with DMA_ATTR_P2PDMA.

This should be equivalent, though support will be somewhat less
(only dma-direct and dma-iommu are currently supported).

Using DMA_ATTR_P2PDMA is safe because the block layer restricts
requests to be much less than 2GB, thus there's no way for a
segment to be greater than 2GB.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/pci.c | 30 ++++++++++++------------------
 1 file changed, 12 insertions(+), 18 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ef7ce464a48d..26976bdf4af0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -528,12 +528,8 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 
 	WARN_ON_ONCE(!iod->nents);
 
-	if (is_pci_p2pdma_page(sg_page(iod->sg)))
-		pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
-				    rq_dma_dir(req));
-	else
-		dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-
+	dma_unmap_sg_attrs(dev->dev, iod->sg, iod->nents, rq_dma_dir(req),
+			   DMA_ATTR_P2PDMA);
 
 	if (iod->npages == 0)
 		dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
@@ -570,7 +566,7 @@ static void nvme_print_sgl(struct scatterlist *sgl, int nents)
 		pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d "
 			"dma_address:%pad dma_length:%d\n",
 			i, &phys, sg->offset, sg->length, &sg_dma_address(sg),
-			sg_dma_len(sg));
+			sg_dma_p2pdma_len(sg));
 	}
 }
 
@@ -581,7 +577,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	struct dma_pool *pool;
 	int length = blk_rq_payload_bytes(req);
 	struct scatterlist *sg = iod->sg;
-	int dma_len = sg_dma_len(sg);
+	int dma_len = sg_dma_p2pdma_len(sg);
 	u64 dma_addr = sg_dma_address(sg);
 	int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
 	__le64 *prp_list;
@@ -601,7 +597,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	} else {
 		sg = sg_next(sg);
 		dma_addr = sg_dma_address(sg);
-		dma_len = sg_dma_len(sg);
+		dma_len = sg_dma_p2pdma_len(sg);
 	}
 
 	if (length <= NVME_CTRL_PAGE_SIZE) {
@@ -650,7 +646,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 			goto bad_sgl;
 		sg = sg_next(sg);
 		dma_addr = sg_dma_address(sg);
-		dma_len = sg_dma_len(sg);
+		dma_len = sg_dma_p2pdma_len(sg);
 	}
 
 done:
@@ -670,7 +666,7 @@ static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
 		struct scatterlist *sg)
 {
 	sge->addr = cpu_to_le64(sg_dma_address(sg));
-	sge->length = cpu_to_le32(sg_dma_len(sg));
+	sge->length = cpu_to_le32(sg_dma_p2pdma_len(sg));
 	sge->type = NVME_SGL_FMT_DATA_DESC << 4;
 }
 
@@ -814,14 +810,12 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	if (!iod->nents)
 		goto out;
 
-	if (is_pci_p2pdma_page(sg_page(iod->sg)))
-		nr_mapped = pci_p2pdma_map_sg_attrs(dev->dev, iod->sg,
-				iod->nents, rq_dma_dir(req), DMA_ATTR_NO_WARN);
-	else
-		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
-					     rq_dma_dir(req), DMA_ATTR_NO_WARN);
-	if (!nr_mapped)
+	nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+			rq_dma_dir(req), DMA_ATTR_NO_WARN | DMA_ATTR_P2PDMA);
+	if (!nr_mapped) {
+		ret = BLK_STS_IOERR;
 		goto out;
+	}
 
 	iod->use_sgl = nvme_pci_use_sgls(dev, req);
 	if (iod->use_sgl)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 10/15] mm: Introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (8 preceding siblings ...)
  2020-11-06 17:00 ` [RFC PATCH 09/15] nvme-pci: Convert to using dma_map_sg for p2pdma pages Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 11/15] iov_iter: Introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If a caller does not set this flag
and tries to map P2PDMA pages it will fail.

This is implemented by adding a flag and a check to get_dev_pagemap().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/dax/super.c      |  7 ++++---
 include/linux/memremap.h |  4 ++--
 include/linux/mm.h       |  1 +
 mm/gup.c                 | 28 +++++++++++++++++-----------
 mm/huge_memory.c         |  8 ++++----
 mm/memory-failure.c      |  4 ++--
 mm/memremap.c            | 14 ++++++++++----
 7 files changed, 40 insertions(+), 26 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index edc279be3e59..59c05839b3f8 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -132,9 +132,10 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
 	} else if (pfn_t_devmap(pfn) && pfn_t_devmap(end_pfn)) {
 		struct dev_pagemap *pgmap, *end_pgmap;
 
-		pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
-		end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL);
-		if (pgmap && pgmap == end_pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX
+		pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL, false);
+		end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL, false);
+		if (!IS_ERR_OR_NULL(pgmap) && pgmap == end_pgmap
+				&& pgmap->type == MEMORY_DEVICE_FS_DAX
 				&& pfn_t_to_page(pfn)->pgmap == pgmap
 				&& pfn_t_to_page(end_pfn)->pgmap == pgmap
 				&& pfn_t_to_pfn(pfn) == PHYS_PFN(__pa(kaddr))
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79c49e7f5c30..14f6d899c657 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -136,7 +136,7 @@ void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-		struct dev_pagemap *pgmap);
+		struct dev_pagemap *pgmap, bool allow_pci_p2pdma);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
@@ -160,7 +160,7 @@ static inline void devm_memunmap_pages(struct device *dev,
 }
 
 static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-		struct dev_pagemap *pgmap)
+		struct dev_pagemap *pgmap, bool allow_pci_p2pdma)
 {
 	return NULL;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aaf..c405af966a4e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2790,6 +2790,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA	0x100000 /* allow returning PCI P2PDMA pages */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 102877ed77a4..abbc0a06d241 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -449,11 +449,16 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		 * case since they are only valid while holding the pgmap
 		 * reference.
 		 */
-		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
-		if (*pgmap)
+		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap,
+					 flags & FOLL_PCI_P2PDMA);
+		if (IS_ERR(*pgmap)) {
+			page = ERR_CAST(*pgmap);
+			goto out;
+		} else if (*pgmap) {
 			page = pte_page(pte);
-		else
+		} else {
 			goto no_page;
+		}
 	} else if (unlikely(!page)) {
 		if (flags & FOLL_DUMP) {
 			/* Avoid special (like zero) pages in core dumps */
@@ -794,7 +799,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	struct page *page;
 
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
-	if (ctx.pgmap)
+	if (!IS_ERR_OR_NULL(ctx.pgmap))
 		put_dev_pagemap(ctx.pgmap);
 	return page;
 }
@@ -1138,7 +1143,7 @@ static long __get_user_pages(struct mm_struct *mm,
 		nr_pages -= page_increm;
 	} while (nr_pages);
 out:
-	if (ctx.pgmap)
+	if (!IS_ERR_OR_NULL(ctx.pgmap))
 		put_dev_pagemap(ctx.pgmap);
 	return i ? i : ret;
 }
@@ -2177,8 +2182,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			if (unlikely(flags & FOLL_LONGTERM))
 				goto pte_unmap;
 
-			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
-			if (unlikely(!pgmap)) {
+			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap,
+						flags & FOLL_PCI_P2PDMA);
+			if (IS_ERR_OR_NULL(pgmap)) {
 				undo_dev_pagemap(nr, nr_start, flags, pages);
 				goto pte_unmap;
 			}
@@ -2221,7 +2227,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 	ret = 1;
 
 pte_unmap:
-	if (pgmap)
+	if (!IS_ERR_OR_NULL(pgmap))
 		put_dev_pagemap(pgmap);
 	pte_unmap(ptem);
 	return ret;
@@ -2255,8 +2261,8 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 	do {
 		struct page *page = pfn_to_page(pfn);
 
-		pgmap = get_dev_pagemap(pfn, pgmap);
-		if (unlikely(!pgmap)) {
+		pgmap = get_dev_pagemap(pfn, pgmap, flags & FOLL_PCI_P2PDMA);
+		if (IS_ERR_OR_NULL(pgmap)) {
 			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
@@ -2681,7 +2687,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
 				       FOLL_FORCE | FOLL_PIN | FOLL_GET |
-				       FOLL_FAST_ONLY)))
+				       FOLL_FAST_ONLY | FOLL_PCI_P2PDMA)))
 		return -EINVAL;
 
 	if (gup_flags & FOLL_PIN)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9474dbc150ed..3030548dc007 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -985,8 +985,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
-	*pgmap = get_dev_pagemap(pfn, *pgmap);
-	if (!*pgmap)
+	*pgmap = get_dev_pagemap(pfn, *pgmap, flags & FOLL_PCI_P2PDMA);
+	if (IS_ERR_OR_NULL(*pgmap))
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
 	if (!try_grab_page(page, flags))
@@ -1159,8 +1159,8 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
-	*pgmap = get_dev_pagemap(pfn, *pgmap);
-	if (!*pgmap)
+	*pgmap = get_dev_pagemap(pfn, *pgmap, flags & FOLL_PCI_P2PDMA);
+	if (IS_ERR_OR_NULL(*pgmap))
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
 	if (!try_grab_page(page, flags))
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c0bb186bba62..44fdad77f06a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1328,8 +1328,8 @@ int memory_failure(unsigned long pfn, int flags)
 	p = pfn_to_online_page(pfn);
 	if (!p) {
 		if (pfn_valid(pfn)) {
-			pgmap = get_dev_pagemap(pfn, NULL);
-			if (pgmap)
+			pgmap = get_dev_pagemap(pfn, NULL, false);
+			if (!IS_ERR_OR_NULL(pgmap))
 				return memory_failure_dev_pagemap(pfn, flags,
 								  pgmap);
 		}
diff --git a/mm/memremap.c b/mm/memremap.c
index 73a206d0f645..5f0284c3a3c3 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -197,14 +197,14 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 				"altmap not supported for multiple ranges\n"))
 		return -EINVAL;
 
-	conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL);
+	conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL, true);
 	if (conflict_pgmap) {
 		WARN(1, "Conflicting mapping in same section\n");
 		put_dev_pagemap(conflict_pgmap);
 		return -ENOMEM;
 	}
 
-	conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL);
+	conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL, true);
 	if (conflict_pgmap) {
 		WARN(1, "Conflicting mapping in same section\n");
 		put_dev_pagemap(conflict_pgmap);
@@ -454,19 +454,20 @@ void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
  * get_dev_pagemap() - take a new live reference on the dev_pagemap for @pfn
  * @pfn: page frame number to lookup page_map
  * @pgmap: optional known pgmap that already has a reference
+ * @allow_pci_p2pdma: allow getting a pgmap with the PCI P2PDMA type
  *
  * If @pgmap is non-NULL and covers @pfn it will be returned as-is.  If @pgmap
  * is non-NULL but does not cover @pfn the reference to it will be released.
  */
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-		struct dev_pagemap *pgmap)
+		struct dev_pagemap *pgmap, bool allow_pci_p2pdma)
 {
 	resource_size_t phys = PFN_PHYS(pfn);
 
 	/*
 	 * In the cached case we're already holding a live reference.
 	 */
-	if (pgmap) {
+	if (!IS_ERR_OR_NULL(pgmap)) {
 		if (phys >= pgmap->range.start && phys <= pgmap->range.end)
 			return pgmap;
 		put_dev_pagemap(pgmap);
@@ -479,6 +480,11 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 		pgmap = NULL;
 	rcu_read_unlock();
 
+	if (!allow_pci_p2pdma && pgmap->type == MEMORY_DEVICE_PCI_P2PDMA) {
+		put_dev_pagemap(pgmap);
+		return ERR_PTR(-EINVAL);
+	}
+
 	return pgmap;
 }
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 11/15] iov_iter: Introduce iov_iter_get_pages_[alloc_]flags()
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (9 preceding siblings ...)
  2020-11-06 17:00 ` [RFC PATCH 10/15] mm: Introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 12/15] block: Set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().

This is so that FOLL_PCI_P2PDMA can be passed when appropriate.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/uio.h | 21 +++++++++++++++++----
 lib/iov_iter.c      | 25 ++++++++++++++-----------
 2 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 72d88566694e..694caa868c05 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -221,10 +221,23 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_
 void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode_info *pipe,
 			size_t count);
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
-ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
-			size_t maxsize, unsigned maxpages, size_t *start);
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
-			size_t maxsize, size_t *start);
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
+		size_t maxsize, unsigned maxpages, size_t *start,
+		unsigned int gup_flags);
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
+		struct page ***pages, size_t maxsize, size_t *start,
+		unsigned int gup_flags);
+static inline ssize_t iov_iter_get_pages(struct iov_iter *i,
+		struct page **pages, size_t maxsize, unsigned maxpages,
+		size_t *start)
+{
+	return iov_iter_get_pages_flags(i, pages, maxsize, maxpages, start, 0);
+}
+static inline ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+		struct page ***pages, size_t maxsize, size_t *start)
+{
+	return iov_iter_get_pages_alloc_flags(i, pages, maxsize, start, 0);
+}
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
 
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 1635111c5bd2..700effe0bb86 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1313,9 +1313,9 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 	return __pipe_get_pages(i, min(maxsize, capacity), pages, iter_head, start);
 }
 
-ssize_t iov_iter_get_pages(struct iov_iter *i,
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
-		   size_t *start)
+		   size_t *start, unsigned int gup_flags)
 {
 	if (maxsize > i->count)
 		maxsize = i->count;
@@ -1325,6 +1325,9 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 	if (unlikely(iov_iter_is_discard(i)))
 		return -EFAULT;
 
+	if (iov_iter_rw(i) != WRITE)
+		gup_flags |= FOLL_WRITE;
+
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
 		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -1335,9 +1338,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 			len = maxpages * PAGE_SIZE;
 		addr &= ~(PAGE_SIZE - 1);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
-				pages);
+		res = get_user_pages_fast(addr, n, gup_flags, pages);
 		if (unlikely(res < 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1352,7 +1353,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 	)
 	return 0;
 }
-EXPORT_SYMBOL(iov_iter_get_pages);
+EXPORT_SYMBOL(iov_iter_get_pages_flags);
 
 static struct page **get_pages_array(size_t n)
 {
@@ -1392,9 +1393,9 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 	return n;
 }
 
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
-		   size_t *start)
+		   size_t *start, unsigned int gup_flags)
 {
 	struct page **p;
 
@@ -1406,6 +1407,9 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 	if (unlikely(iov_iter_is_discard(i)))
 		return -EFAULT;
 
+	if (iov_iter_rw(i) != WRITE)
+		gup_flags |= FOLL_WRITE;
+
 	iterate_all_kinds(i, maxsize, v, ({
 		unsigned long addr = (unsigned long)v.iov_base;
 		size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -1417,8 +1421,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		p = get_pages_array(n);
 		if (!p)
 			return -ENOMEM;
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
+		res = get_user_pages_fast(addr, n, gup_flags, p);
 		if (unlikely(res < 0)) {
 			kvfree(p);
 			return res;
@@ -1439,7 +1442,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 	)
 	return 0;
 }
-EXPORT_SYMBOL(iov_iter_get_pages_alloc);
+EXPORT_SYMBOL(iov_iter_get_pages_alloc_flags);
 
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 			       struct iov_iter *i)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 12/15] block: Set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (10 preceding siblings ...)
  2020-11-06 17:00 ` [RFC PATCH 11/15] iov_iter: Introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 13/15] block: Set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/bio.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index fa01bef35bb1..6b28256a5575 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -997,6 +997,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
 	bool same_page = false;
+	unsigned int flags = 0;
 	ssize_t size, left;
 	unsigned len, i;
 	size_t offset;
@@ -1009,7 +1010,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	if (bio->bi_disk && blk_queue_pci_p2pdma(bio->bi_disk->queue))
+		flags |= FOLL_PCI_P2PDMA;
+
+	size = iov_iter_get_pages_flags(iter, pages, LONG_MAX, nr_pages,
+					&offset, flags);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 13/15] block: Set FOLL_PCI_P2PDMA in bio_map_user_iov()
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (11 preceding siblings ...)
  2020-11-06 17:00 ` [RFC PATCH 12/15] block: Set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem() Logan Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 15/15] nvme-pci: Allow mmaping the CMB in userspace Logan Gunthorpe
  14 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk-map.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 21630dccac62..cad8fcfe4f17 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -245,6 +245,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 {
 	unsigned int max_sectors = queue_max_hw_sectors(rq->q);
 	struct bio *bio, *bounce_bio;
+	unsigned int flags = 0;
 	int ret;
 	int j;
 
@@ -256,13 +257,17 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		return -ENOMEM;
 	bio->bi_opf |= req_op(rq);
 
+	if (blk_queue_pci_p2pdma(rq->q))
+		flags |= FOLL_PCI_P2PDMA;
+
 	while (iov_iter_count(iter)) {
 		struct page **pages;
 		ssize_t bytes;
 		size_t offs, added = 0;
 		int npages;
 
-		bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+		bytes = iov_iter_get_pages_alloc_flags(iter, &pages, LONG_MAX,
+						       &offs, flags);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (12 preceding siblings ...)
  2020-11-06 17:00 ` [RFC PATCH 13/15] block: Set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-06 17:22   ` Jason Gunthorpe
  2020-11-06 17:00 ` [RFC PATCH 15/15] nvme-pci: Allow mmaping the CMB in userspace Logan Gunthorpe
  14 siblings, 1 reply; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
a hunk of p2pmem into userspace.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c       | 104 +++++++++++++++++++++++++++++++++++++
 include/linux/pci-p2pdma.h |   6 +++
 2 files changed, 110 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 9961e779f430..8eab53ac59ae 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -16,6 +16,7 @@
 #include <linux/genalloc.h>
 #include <linux/memremap.h>
 #include <linux/percpu-refcount.h>
+#include <linux/pfn_t.h>
 #include <linux/random.h>
 #include <linux/seq_buf.h>
 #include <linux/xarray.h>
@@ -1055,3 +1056,106 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
 	return sprintf(page, "%s\n", pci_name(p2p_dev));
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
+
+struct pci_p2pdma_map {
+	struct kref ref;
+	struct pci_dev *pdev;
+	void *kaddr;
+	size_t len;
+};
+
+static struct pci_p2pdma_map *pci_p2pdma_map_alloc(struct pci_dev *pdev,
+						   size_t len)
+{
+	struct pci_p2pdma_map *pmap;
+
+	pmap = kzalloc(sizeof(*pmap), GFP_KERNEL);
+	if (!pmap)
+		return NULL;
+
+	kref_init(&pmap->ref);
+	pmap->pdev = pdev;
+	pmap->len = len;
+
+	pmap->kaddr = pci_alloc_p2pmem(pdev, len);
+	if (!pmap->kaddr) {
+		kfree(pmap);
+		return NULL;
+	}
+
+	return pmap;
+}
+
+static void pci_p2pdma_map_free(struct kref *ref)
+{
+	struct pci_p2pdma_map *pmap =
+		container_of(ref, struct pci_p2pdma_map, ref);
+
+	pci_free_p2pmem(pmap->pdev, pmap->kaddr, pmap->len);
+	kfree(pmap);
+}
+
+static void pci_p2pdma_vma_open(struct vm_area_struct *vma)
+{
+	struct pci_p2pdma_map *pmap = vma->vm_private_data;
+
+	kref_get(&pmap->ref);
+}
+
+static void pci_p2pdma_vma_close(struct vm_area_struct *vma)
+{
+	struct pci_p2pdma_map *pmap = vma->vm_private_data;
+
+	kref_put(&pmap->ref, pci_p2pdma_map_free);
+}
+
+const struct vm_operations_struct pci_p2pdma_vmops = {
+	.open = pci_p2pdma_vma_open,
+	.close = pci_p2pdma_vma_close,
+};
+
+/**
+ * pci_mmap_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev: the device to allocate memory from
+ * @vma: the userspace vma to map the memory to
+ *
+ * Returns 0 on success, or a negative error code on failure
+ */
+int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma)
+{
+	struct pci_p2pdma_map *pmap;
+	unsigned long addr, pfn;
+	vm_fault_t ret;
+
+	/* prevent private mappings from being established */
+	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+		pci_info_ratelimited(pdev,
+				     "%s: fail, attempted private mapping\n",
+				     current->comm);
+		return -EINVAL;
+	}
+
+	pmap = pci_p2pdma_map_alloc(pdev, vma->vm_end - vma->vm_start);
+	if (!pmap)
+		return -ENOMEM;
+
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_private_data = pmap;
+	vma->vm_ops = &pci_p2pdma_vmops;
+
+	pfn = virt_to_phys(pmap->kaddr) >> PAGE_SHIFT;
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+		ret = vmf_insert_mixed(vma, addr,
+				       __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP));
+		if (ret & VM_FAULT_ERROR)
+			goto out_error;
+		pfn++;
+	}
+
+	return 0;
+
+out_error:
+	kref_put(&pmap->ref, pci_p2pdma_map_free);
+	return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(pci_mmap_p2pmem);
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index fc5de47eeac4..26fe40363d1c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -40,6 +40,7 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 			    bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
 			       bool use_p2pdma);
+int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
@@ -116,6 +117,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page,
 {
 	return sprintf(page, "none\n");
 }
+static inline int pci_mmap_p2pmem(struct pci_dev *pdev,
+				  struct vm_area_struct *vma)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_PCI_P2PDMA */
 
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC PATCH 15/15] nvme-pci: Allow mmaping the CMB in userspace
  2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (13 preceding siblings ...)
  2020-11-06 17:00 ` [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem() Logan Gunthorpe
@ 2020-11-06 17:00 ` Logan Gunthorpe
  2020-11-06 17:39   ` Jason Gunthorpe
  2020-11-09 15:03   ` Keith Busch
  14 siblings, 2 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:00 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Logan Gunthorpe

Allow userspace to obtain CMB memory by mmaping the controller's
char device. The mmap call allocates and returns a hunk of CMB memory,
(the offset is ignored) so userspace does not have control over the
address within the CMB.

A VMA allocated in this way will only be usable by drivers that set
FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be
checked the first time the pages are mapped for DMA.

Currently this is only supported by O_DIRECT to an PCI NVMe device
or through the NVMe passthrough IOCTL.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/core.c | 11 +++++++++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  |  9 +++++++++
 3 files changed, 21 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f14316c9b34a..fc642aba671d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3240,12 +3240,23 @@ static long nvme_dev_ioctl(struct file *file, unsigned int cmd,
 	}
 }
 
+static int nvme_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct nvme_ctrl *ctrl = file->private_data;
+
+	if (!ctrl->ops->mmap_cmb)
+		return -ENODEV;
+
+	return ctrl->ops->mmap_cmb(ctrl, vma);
+}
+
 static const struct file_operations nvme_dev_fops = {
 	.owner		= THIS_MODULE,
 	.open		= nvme_dev_open,
 	.release	= nvme_dev_release,
 	.unlocked_ioctl	= nvme_dev_ioctl,
 	.compat_ioctl	= compat_ptr_ioctl,
+	.mmap		= nvme_dev_mmap,
 };
 
 static ssize_t nvme_sysfs_reset(struct device *dev,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a0bfe8709cfc..3d790d849701 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -473,6 +473,7 @@ struct nvme_ctrl_ops {
 	void (*delete_ctrl)(struct nvme_ctrl *ctrl);
 	int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
 	bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
+	int (*mmap_cmb)(struct nvme_ctrl *ctrl, struct vm_area_struct *vma);
 };
 
 #ifdef CONFIG_FAULT_INJECTION_DEBUG_FS
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 26976bdf4af0..9c61573111ea 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2687,6 +2687,14 @@ static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
 	return dma_pci_p2pdma_supported(dev->dev);
 }
 
+static int nvme_pci_mmap_cmb(struct nvme_ctrl *ctrl,
+			     struct vm_area_struct *vma)
+{
+	struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+	return pci_mmap_p2pmem(pdev, vma);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
@@ -2698,6 +2706,7 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.submit_async_event	= nvme_pci_submit_async_event,
 	.get_address		= nvme_pci_get_address,
 	.supports_pci_p2pdma	= nvme_pci_supports_pci_p2pdma,
+	.mmap_cmb		= nvme_pci_mmap_cmb,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-06 17:00 ` [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem() Logan Gunthorpe
@ 2020-11-06 17:22   ` Jason Gunthorpe
  2020-11-06 17:28     ` Logan Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-11-06 17:22 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter

On Fri, Nov 06, 2020 at 10:00:35AM -0700, Logan Gunthorpe wrote:
> Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
> a hunk of p2pmem into userspace.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>  drivers/pci/p2pdma.c       | 104 +++++++++++++++++++++++++++++++++++++
>  include/linux/pci-p2pdma.h |   6 +++
>  2 files changed, 110 insertions(+)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 9961e779f430..8eab53ac59ae 100644
> +++ b/drivers/pci/p2pdma.c
> @@ -16,6 +16,7 @@
>  #include <linux/genalloc.h>
>  #include <linux/memremap.h>
>  #include <linux/percpu-refcount.h>
> +#include <linux/pfn_t.h>
>  #include <linux/random.h>
>  #include <linux/seq_buf.h>
>  #include <linux/xarray.h>
> @@ -1055,3 +1056,106 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
>  	return sprintf(page, "%s\n", pci_name(p2p_dev));
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
> +
> +struct pci_p2pdma_map {
> +	struct kref ref;
> +	struct pci_dev *pdev;
> +	void *kaddr;
> +	size_t len;
> +};

Why have this at all? Nothing uses it and no vm_operations ops are
implemented?

This is very inflexable, it would be better if this is designed like
io_remap_pfn where it just preps and fills the VMA, doesn't take
ownership of the entire VMA

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-06 17:22   ` Jason Gunthorpe
@ 2020-11-06 17:28     ` Logan Gunthorpe
  2020-11-06 17:42       ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter



On 2020-11-06 10:22 a.m., Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 10:00:35AM -0700, Logan Gunthorpe wrote:
>> Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
>> a hunk of p2pmem into userspace.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>>  drivers/pci/p2pdma.c       | 104 +++++++++++++++++++++++++++++++++++++
>>  include/linux/pci-p2pdma.h |   6 +++
>>  2 files changed, 110 insertions(+)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index 9961e779f430..8eab53ac59ae 100644
>> +++ b/drivers/pci/p2pdma.c
>> @@ -16,6 +16,7 @@
>>  #include <linux/genalloc.h>
>>  #include <linux/memremap.h>
>>  #include <linux/percpu-refcount.h>
>> +#include <linux/pfn_t.h>
>>  #include <linux/random.h>
>>  #include <linux/seq_buf.h>
>>  #include <linux/xarray.h>
>> @@ -1055,3 +1056,106 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
>>  	return sprintf(page, "%s\n", pci_name(p2p_dev));
>>  }
>>  EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
>> +
>> +struct pci_p2pdma_map {
>> +	struct kref ref;
>> +	struct pci_dev *pdev;
>> +	void *kaddr;
>> +	size_t len;
>> +};
> 
> Why have this at all? Nothing uses it and no vm_operations ops are
> implemented?

It's necessary to free the allocated p2pmem when the mapping is torn down.

> This is very inflexable, it would be better if this is designed like
> io_remap_pfn where it just preps and fills the VMA, doesn't take
> ownership of the entire VMA

If someone wants to manage their own P2P memory and create their own
userspace mapping with vmf_insert_mixed they are free to do so. But this
helper is specifically for users of pci_p2pdma_map_alloc(). I know you
don't intend to use that but it doesn't make it any less valid.

Logan


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 15/15] nvme-pci: Allow mmaping the CMB in userspace
  2020-11-06 17:00 ` [RFC PATCH 15/15] nvme-pci: Allow mmaping the CMB in userspace Logan Gunthorpe
@ 2020-11-06 17:39   ` Jason Gunthorpe
  2020-11-06 17:43     ` Logan Gunthorpe
  2020-11-09 15:03   ` Keith Busch
  1 sibling, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-11-06 17:39 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter

On Fri, Nov 06, 2020 at 10:00:36AM -0700, Logan Gunthorpe wrote:
> Allow userspace to obtain CMB memory by mmaping the controller's
> char device. The mmap call allocates and returns a hunk of CMB memory,
> (the offset is ignored) so userspace does not have control over the
> address within the CMB.
> 
> A VMA allocated in this way will only be usable by drivers that set
> FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be
> checked the first time the pages are mapped for DMA.
> 
> Currently this is only supported by O_DIRECT to an PCI NVMe device
> or through the NVMe passthrough IOCTL.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>  drivers/nvme/host/core.c | 11 +++++++++++
>  drivers/nvme/host/nvme.h |  1 +
>  drivers/nvme/host/pci.c  |  9 +++++++++
>  3 files changed, 21 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f14316c9b34a..fc642aba671d 100644
> +++ b/drivers/nvme/host/core.c
> @@ -3240,12 +3240,23 @@ static long nvme_dev_ioctl(struct file *file, unsigned int cmd,
>  	}
>  }
>  
> +static int nvme_dev_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct nvme_ctrl *ctrl = file->private_data;
> +
> +	if (!ctrl->ops->mmap_cmb)
> +		return -ENODEV;
> +
> +	return ctrl->ops->mmap_cmb(ctrl, vma);
> +}

This needs to ensure that the VMA created is destroyed before the
driver is unprobed - ie the struct pages backing the BAR memory is
destroyed.

I don't see anything that synchronizes this in the nvme_dev_release()?

Many places do this by putting all the VMAs into an address space and
zaping the address space when unprobing the driver to revoke the
pages, but there is a tricky race here :\

https://lore.kernel.org/dri-devel/20201021125030.GK36674@ziepe.ca/

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-06 17:28     ` Logan Gunthorpe
@ 2020-11-06 17:42       ` Jason Gunthorpe
  2020-11-06 17:53         ` Logan Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-11-06 17:42 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter

On Fri, Nov 06, 2020 at 10:28:00AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2020-11-06 10:22 a.m., Jason Gunthorpe wrote:
> > On Fri, Nov 06, 2020 at 10:00:35AM -0700, Logan Gunthorpe wrote:
> >> Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
> >> a hunk of p2pmem into userspace.
> >>
> >> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> >>  drivers/pci/p2pdma.c       | 104 +++++++++++++++++++++++++++++++++++++
> >>  include/linux/pci-p2pdma.h |   6 +++
> >>  2 files changed, 110 insertions(+)
> >>
> >> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> >> index 9961e779f430..8eab53ac59ae 100644
> >> +++ b/drivers/pci/p2pdma.c
> >> @@ -16,6 +16,7 @@
> >>  #include <linux/genalloc.h>
> >>  #include <linux/memremap.h>
> >>  #include <linux/percpu-refcount.h>
> >> +#include <linux/pfn_t.h>
> >>  #include <linux/random.h>
> >>  #include <linux/seq_buf.h>
> >>  #include <linux/xarray.h>
> >> @@ -1055,3 +1056,106 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> >>  	return sprintf(page, "%s\n", pci_name(p2p_dev));
> >>  }
> >>  EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
> >> +
> >> +struct pci_p2pdma_map {
> >> +	struct kref ref;
> >> +	struct pci_dev *pdev;
> >> +	void *kaddr;
> >> +	size_t len;
> >> +};
> > 
> > Why have this at all? Nothing uses it and no vm_operations ops are
> > implemented?
> 
> It's necessary to free the allocated p2pmem when the mapping is torn down.

That's suspicious.. Once in a VMA the lifetime of the page must be
controlled by the page refcount, it can't be put back into the genpool
just because the vma was destroed.

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 15/15] nvme-pci: Allow mmaping the CMB in userspace
  2020-11-06 17:39   ` Jason Gunthorpe
@ 2020-11-06 17:43     ` Logan Gunthorpe
  0 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter




On 2020-11-06 10:39 a.m., Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 10:00:36AM -0700, Logan Gunthorpe wrote:
>> Allow userspace to obtain CMB memory by mmaping the controller's
>> char device. The mmap call allocates and returns a hunk of CMB memory,
>> (the offset is ignored) so userspace does not have control over the
>> address within the CMB.
>>
>> A VMA allocated in this way will only be usable by drivers that set
>> FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be
>> checked the first time the pages are mapped for DMA.
>>
>> Currently this is only supported by O_DIRECT to an PCI NVMe device
>> or through the NVMe passthrough IOCTL.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>>  drivers/nvme/host/core.c | 11 +++++++++++
>>  drivers/nvme/host/nvme.h |  1 +
>>  drivers/nvme/host/pci.c  |  9 +++++++++
>>  3 files changed, 21 insertions(+)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index f14316c9b34a..fc642aba671d 100644
>> +++ b/drivers/nvme/host/core.c
>> @@ -3240,12 +3240,23 @@ static long nvme_dev_ioctl(struct file *file, unsigned int cmd,
>>  	}
>>  }
>>  
>> +static int nvme_dev_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> +	struct nvme_ctrl *ctrl = file->private_data;
>> +
>> +	if (!ctrl->ops->mmap_cmb)
>> +		return -ENODEV;
>> +
>> +	return ctrl->ops->mmap_cmb(ctrl, vma);
>> +}
> 
> This needs to ensure that the VMA created is destroyed before the
> driver is unprobed - ie the struct pages backing the BAR memory is
> destroyed.
> 
> I don't see anything that synchronizes this in the nvme_dev_release()?

Yup, looks like something that needs to be fixed. Though I'd probably do
it in the pci_p2pdma helper code instead.

Logan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-06 17:42       ` Jason Gunthorpe
@ 2020-11-06 17:53         ` Logan Gunthorpe
  2020-11-06 18:09           ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 17:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter



On 2020-11-06 10:42 a.m., Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 10:28:00AM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2020-11-06 10:22 a.m., Jason Gunthorpe wrote:
>>> On Fri, Nov 06, 2020 at 10:00:35AM -0700, Logan Gunthorpe wrote:
>>>> Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
>>>> a hunk of p2pmem into userspace.
>>>>
>>>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>>>>  drivers/pci/p2pdma.c       | 104 +++++++++++++++++++++++++++++++++++++
>>>>  include/linux/pci-p2pdma.h |   6 +++
>>>>  2 files changed, 110 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>>>> index 9961e779f430..8eab53ac59ae 100644
>>>> +++ b/drivers/pci/p2pdma.c
>>>> @@ -16,6 +16,7 @@
>>>>  #include <linux/genalloc.h>
>>>>  #include <linux/memremap.h>
>>>>  #include <linux/percpu-refcount.h>
>>>> +#include <linux/pfn_t.h>
>>>>  #include <linux/random.h>
>>>>  #include <linux/seq_buf.h>
>>>>  #include <linux/xarray.h>
>>>> @@ -1055,3 +1056,106 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
>>>>  	return sprintf(page, "%s\n", pci_name(p2p_dev));
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
>>>> +
>>>> +struct pci_p2pdma_map {
>>>> +	struct kref ref;
>>>> +	struct pci_dev *pdev;
>>>> +	void *kaddr;
>>>> +	size_t len;
>>>> +};
>>>
>>> Why have this at all? Nothing uses it and no vm_operations ops are
>>> implemented?
>>
>> It's necessary to free the allocated p2pmem when the mapping is torn down.
> 
> That's suspicious.. Once in a VMA the lifetime of the page must be
> controlled by the page refcount, it can't be put back into the genpool
> just because the vma was destroed.

Ah, hmm, yes. I guess the pages have to be hooked and returned to the
genalloc through free_devmap_managed_page(). Seems like it might be
doable... but it will complicate things for users that don't want to use
the genpool (though no such users exist upstream).

Logan


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-06 17:53         ` Logan Gunthorpe
@ 2020-11-06 18:09           ` Jason Gunthorpe
  2020-11-06 18:20             ` Logan Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-11-06 18:09 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter

On Fri, Nov 06, 2020 at 10:53:45AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2020-11-06 10:42 a.m., Jason Gunthorpe wrote:
> > On Fri, Nov 06, 2020 at 10:28:00AM -0700, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2020-11-06 10:22 a.m., Jason Gunthorpe wrote:
> >>> On Fri, Nov 06, 2020 at 10:00:35AM -0700, Logan Gunthorpe wrote:
> >>>> Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
> >>>> a hunk of p2pmem into userspace.
> >>>>
> >>>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> >>>>  drivers/pci/p2pdma.c       | 104 +++++++++++++++++++++++++++++++++++++
> >>>>  include/linux/pci-p2pdma.h |   6 +++
> >>>>  2 files changed, 110 insertions(+)
> >>>>
> >>>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> >>>> index 9961e779f430..8eab53ac59ae 100644
> >>>> +++ b/drivers/pci/p2pdma.c
> >>>> @@ -16,6 +16,7 @@
> >>>>  #include <linux/genalloc.h>
> >>>>  #include <linux/memremap.h>
> >>>>  #include <linux/percpu-refcount.h>
> >>>> +#include <linux/pfn_t.h>
> >>>>  #include <linux/random.h>
> >>>>  #include <linux/seq_buf.h>
> >>>>  #include <linux/xarray.h>
> >>>> @@ -1055,3 +1056,106 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> >>>>  	return sprintf(page, "%s\n", pci_name(p2p_dev));
> >>>>  }
> >>>>  EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
> >>>> +
> >>>> +struct pci_p2pdma_map {
> >>>> +	struct kref ref;
> >>>> +	struct pci_dev *pdev;
> >>>> +	void *kaddr;
> >>>> +	size_t len;
> >>>> +};
> >>>
> >>> Why have this at all? Nothing uses it and no vm_operations ops are
> >>> implemented?
> >>
> >> It's necessary to free the allocated p2pmem when the mapping is torn down.
> > 
> > That's suspicious.. Once in a VMA the lifetime of the page must be
> > controlled by the page refcount, it can't be put back into the genpool
> > just because the vma was destroed.
> 
> Ah, hmm, yes. I guess the pages have to be hooked and returned to the
> genalloc through free_devmap_managed_page(). 

That sounds about right, but in this case it doesn't need the VMA
operations.

> Seems like it might be doable... but it will complicate things for
> users that don't want to use the genpool (though no such users exist
> upstream).

I would like to use this stuff in RDMA pretty much immediately and the
genpool is harmful for those cases, so please don't make decisions
that are tying thing to genpool

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-06 18:09           ` Jason Gunthorpe
@ 2020-11-06 18:20             ` Logan Gunthorpe
  2020-11-06 19:30               ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 18:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter



On 2020-11-06 11:09 a.m., Jason Gunthorpe wrote:
>> Ah, hmm, yes. I guess the pages have to be hooked and returned to the
>> genalloc through free_devmap_managed_page(). 
> 
> That sounds about right, but in this case it doesn't need the VMA
> operations.
> 
>> Seems like it might be doable... but it will complicate things for
>> users that don't want to use the genpool (though no such users exist
>> upstream).
> 
> I would like to use this stuff in RDMA pretty much immediately and the
> genpool is harmful for those cases, so please don't make decisions
> that are tying thing to genpool

I certainly can't make decisions for code that isn't currently upstream.
So you will almost certainly have to make changes for the code you want
to add, as is the standard procedure. I can't and should not export APIs
that you might need that have no upstream users, but you are certainly
free to send patches that create them when you add the use case.

Ultimately, if you aren't using the genpool you will have to implement
your own mmap operation that somehow allocates the pages and your own
page_free hook. The latter can be accommodated for by a patch that
splits off pci_p2pdma_add_resource() into a function that doesn't use
the genpool (I've already seen two independent developers create a
similar patch for this but with no upstream user, they couldn't be taken
upstream).

I also don't expect this to be going upstream in the near term so don't
get too excited about using it.

Logan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-06 18:20             ` Logan Gunthorpe
@ 2020-11-06 19:30               ` Jason Gunthorpe
  2020-11-06 19:44                 ` Logan Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-11-06 19:30 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter

On Fri, Nov 06, 2020 at 11:20:05AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2020-11-06 11:09 a.m., Jason Gunthorpe wrote:
> >> Ah, hmm, yes. I guess the pages have to be hooked and returned to the
> >> genalloc through free_devmap_managed_page(). 
> > 
> > That sounds about right, but in this case it doesn't need the VMA
> > operations.
> > 
> >> Seems like it might be doable... but it will complicate things for
> >> users that don't want to use the genpool (though no such users exist
> >> upstream).
> > 
> > I would like to use this stuff in RDMA pretty much immediately and the
> > genpool is harmful for those cases, so please don't make decisions
> > that are tying thing to genpool
> 
> I certainly can't make decisions for code that isn't currently
> upstream.

The rdma drivers are all upstream, what are you thinking about?

> Ultimately, if you aren't using the genpool you will have to implement
> your own mmap operation that somehow allocates the pages and your own
> page_free hook. 

Sure, the mlx5 driver already has a specialized alloctor for it's BAR
pages.

> I also don't expect this to be going upstream in the near term so don't
> get too excited about using it.

I don't know, it is actually not that horrible, the GUP and IOMMU
related changes are simpler than I expected

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-06 19:30               ` Jason Gunthorpe
@ 2020-11-06 19:44                 ` Logan Gunthorpe
  2020-11-06 19:53                   ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 19:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter



On 2020-11-06 12:30 p.m., Jason Gunthorpe wrote:
>> I certainly can't make decisions for code that isn't currently
>> upstream.
> 
> The rdma drivers are all upstream, what are you thinking about?

Really? I feel like you should know what I mean here...

I mean upstream code that actually uses the APIs that I'd have to
introduce. I can't say here's an API feature that no code uses but the
already upstream rdma driver might use eventually. It's fairly easy to
send patches that make the necessary changes when someone adds a use of
those changes inside the rdma code.

>> Ultimately, if you aren't using the genpool you will have to implement
>> your own mmap operation that somehow allocates the pages and your own
>> page_free hook. 
> 
> Sure, the mlx5 driver already has a specialized alloctor for it's BAR
> pages.

So it *might* make sense to carve out a common helper to setup a VMA for
P2PDMA to do the vm_flags check and set VM_MIXEDMAP... but besides that,
there's no code that would be common to the two cases.

>> I also don't expect this to be going upstream in the near term so don't
>> get too excited about using it.
> 
> I don't know, it is actually not that horrible, the GUP and IOMMU
> related changes are simpler than I expected

I think the deal breaker is the SGL hack and the fact that there are
important IOMMU implementations that won't have support.

Logan


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-06 19:44                 ` Logan Gunthorpe
@ 2020-11-06 19:53                   ` Jason Gunthorpe
  2020-11-06 20:03                     ` Logan Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-11-06 19:53 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter

On Fri, Nov 06, 2020 at 12:44:59PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2020-11-06 12:30 p.m., Jason Gunthorpe wrote:
> >> I certainly can't make decisions for code that isn't currently
> >> upstream.
> > 
> > The rdma drivers are all upstream, what are you thinking about?
> 
> Really? I feel like you should know what I mean here...
> 
> I mean upstream code that actually uses the APIs that I'd have to
> introduce. I can't say here's an API feature that no code uses but the
> already upstream rdma driver might use eventually. It's fairly easy to
> send patches that make the necessary changes when someone adds a use of
> those changes inside the rdma code.

Sure, but that doesn't mean you have to actively design things to be
unusable beyond this narrow case. The RDMA drivers are all there, all
upstream, if this work is accepted then the changes to insert P2P
pages into their existing mmap flows is a reasonable usecase to
consider at this point when building core code APIs.

You shouldn't add dead code, but at least have a design in mind for
what it needs to look like and some allowance.

> >> Ultimately, if you aren't using the genpool you will have to implement
> >> your own mmap operation that somehow allocates the pages and your own
> >> page_free hook. 
> > 
> > Sure, the mlx5 driver already has a specialized alloctor for it's BAR
> > pages.
> 
> So it *might* make sense to carve out a common helper to setup a VMA for
> P2PDMA to do the vm_flags check and set VM_MIXEDMAP... but besides that,
> there's no code that would be common to the two cases.

I think the whole insertion of P2PDMA pages into a VMA should be
similar to io_remap_pfn() so all the VM flags, pgprot_decrypted and
other subtle details are all in one place. (also it needs a
pgprot_decrypted before doing vmf_insert, I just learned that this
month)

> >> I also don't expect this to be going upstream in the near term so don't
> >> get too excited about using it.
> > 
> > I don't know, it is actually not that horrible, the GUP and IOMMU
> > related changes are simpler than I expected
> 
> I think the deal breaker is the SGL hack and the fact that there are
> important IOMMU implementations that won't have support.

Yes, that is pretty hacky, maybe someone will have a better idea

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-06 19:53                   ` Jason Gunthorpe
@ 2020-11-06 20:03                     ` Logan Gunthorpe
  2020-11-07  0:14                       ` Jason Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-06 20:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter




On 2020-11-06 12:53 p.m., Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 12:44:59PM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2020-11-06 12:30 p.m., Jason Gunthorpe wrote:
>>>> I certainly can't make decisions for code that isn't currently
>>>> upstream.
>>>
>>> The rdma drivers are all upstream, what are you thinking about?
>>
>> Really? I feel like you should know what I mean here...
>>
>> I mean upstream code that actually uses the APIs that I'd have to
>> introduce. I can't say here's an API feature that no code uses but the
>> already upstream rdma driver might use eventually. It's fairly easy to
>> send patches that make the necessary changes when someone adds a use of
>> those changes inside the rdma code.
> 
> Sure, but that doesn't mean you have to actively design things to be
> unusable beyond this narrow case. The RDMA drivers are all there, all
> upstream, if this work is accepted then the changes to insert P2P
> pages into their existing mmap flows is a reasonable usecase to
> consider at this point when building core code APIs.
> 
> You shouldn't add dead code, but at least have a design in mind for
> what it needs to look like and some allowance.

I don't see anything I've done that's at odds with that. You will still
need to make changes to the p2pdma code to implement your use case.

>>>> Ultimately, if you aren't using the genpool you will have to implement
>>>> your own mmap operation that somehow allocates the pages and your own
>>>> page_free hook. 
>>>
>>> Sure, the mlx5 driver already has a specialized alloctor for it's BAR
>>> pages.
>>
>> So it *might* make sense to carve out a common helper to setup a VMA for
>> P2PDMA to do the vm_flags check and set VM_MIXEDMAP... but besides that,
>> there's no code that would be common to the two cases.
> 
> I think the whole insertion of P2PDMA pages into a VMA should be
> similar to io_remap_pfn() so all the VM flags, pgprot_decrypted and
> other subtle details are all in one place. (also it needs a
> pgprot_decrypted before doing vmf_insert, I just learned that this
> month)

I don't think a function like that will work for the p2pmem use case. In
order to implement proper page freeing I expect I'll need a loop around
the allocator and vm_insert_mixed()... Something roughly like:

for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
        vaddr = pci_alloc_p2pmem(pdev, PAGE_SIZE);
	ret = vmf_insert_mixed(vma, addr,
  		       __pfn_to_pfn_t(virt_to_pfn(vaddr), PFN_DEV | PFN_MAP));
}

That way we can call pci_free_p2pmem() when a page's ref count goes to
zero. I suspect your use case will need to do something similar.

Logan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-06 20:03                     ` Logan Gunthorpe
@ 2020-11-07  0:14                       ` Jason Gunthorpe
  2020-11-07  2:50                         ` Logan Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2020-11-07  0:14 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter

On Fri, Nov 06, 2020 at 01:03:26PM -0700, Logan Gunthorpe wrote:
> I don't think a function like that will work for the p2pmem use case. In
> order to implement proper page freeing I expect I'll need a loop around
> the allocator and vm_insert_mixed()... Something roughly like:
> 
> for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
>         vaddr = pci_alloc_p2pmem(pdev, PAGE_SIZE);
> 	ret = vmf_insert_mixed(vma, addr,
>   		       __pfn_to_pfn_t(virt_to_pfn(vaddr), PFN_DEV | PFN_MAP));
> }
> 
> That way we can call pci_free_p2pmem() when a page's ref count goes to
> zero. I suspect your use case will need to do something similar.

Yes, but I would say the pci_alloc_p2pmem() layer should be able to
free pages on a page-by-page basis so you don't have to do stuff like
the above.

There is often a lot of value in having physical contiguous addresses,
so allocating page by page as well seems poor.

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  2020-11-07  0:14                       ` Jason Gunthorpe
@ 2020-11-07  2:50                         ` Logan Gunthorpe
  0 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-07  2:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter



On 2020-11-06 5:14 p.m., Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 01:03:26PM -0700, Logan Gunthorpe wrote:
>> I don't think a function like that will work for the p2pmem use case. In
>> order to implement proper page freeing I expect I'll need a loop around
>> the allocator and vm_insert_mixed()... Something roughly like:
>>
>> for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
>>         vaddr = pci_alloc_p2pmem(pdev, PAGE_SIZE);
>> 	ret = vmf_insert_mixed(vma, addr,
>>   		       __pfn_to_pfn_t(virt_to_pfn(vaddr), PFN_DEV | PFN_MAP));
>> }
>>
>> That way we can call pci_free_p2pmem() when a page's ref count goes to
>> zero. I suspect your use case will need to do something similar.
> 
> Yes, but I would say the pci_alloc_p2pmem() layer should be able to
> free pages on a page-by-page basis so you don't have to do stuff like
> the above.
> 
> There is often a lot of value in having physical contiguous addresses,
> so allocating page by page as well seems poor.

Agreed. But I'll have to dig to see if genalloc supports freeing blocks
in different sizes than the allocations.

Logan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 01/15] PCI/P2PDMA: Don't sleep in upstream_bridge_distance_warn()
  2020-11-06 17:00 ` [RFC PATCH 01/15] PCI/P2PDMA: Don't sleep in upstream_bridge_distance_warn() Logan Gunthorpe
@ 2020-11-09  9:12   ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2020-11-09  9:12 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter

On Fri, Nov 06, 2020 at 10:00:22AM -0700, Logan Gunthorpe wrote:
> In order to call this function from a dma_map function, it must not sleep.
> The only reason it does sleep so to allocate the seqbuf to print
> which devices are within the ACS path.
> 
> Switch the kmalloc call to use GFP_NOWAIT and simply not print that
> message if the buffer fails to be allocated.

Please pass in the actual gfp_t.  Especially from an I/O path
GFP_NOWAIT is not the right gfp_t anyway, you probably want GFP_ATOMIC
there.  But also for the path where we can sleep we should allow that.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  2020-11-06 17:00 ` [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
@ 2020-11-09  9:12   ` Christoph Hellwig
  2020-11-09 14:02     ` Robin Murphy
  2020-11-09 16:47     ` Logan Gunthorpe
  0 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2020-11-09  9:12 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter

On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
> We make use of the top bit of the dma_length to indicate a P2PDMA
> segment.

I don't think "we" can.  There is nothing limiting the size of a SGL
segment.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  2020-11-09  9:12   ` Christoph Hellwig
@ 2020-11-09 14:02     ` Robin Murphy
  2020-11-09 16:47     ` Logan Gunthorpe
  1 sibling, 0 replies; 42+ messages in thread
From: Robin Murphy @ 2020-11-09 14:02 UTC (permalink / raw)
  To: Christoph Hellwig, Logan Gunthorpe
  Cc: Matthew Wilcox, Jason Gunthorpe, linux-pci, Daniel Vetter,
	Ira Weiny, linux-kernel, linux-nvme, Stephen Bates, linux-block,
	linux-mm, iommu, Christian König, John Hubbard,
	Dan Williams

On 2020-11-09 09:12, Christoph Hellwig wrote:
> On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
>> We make use of the top bit of the dma_length to indicate a P2PDMA
>> segment.
> 
> I don't think "we" can.  There is nothing limiting the size of a SGL
> segment.

Right, the story behind ab2cbeb0ed30 ("iommu/dma: Handle SG length 
overflow better") comes immediately to mind, for one. If all the P2P 
users can agree to be in on the game then by all means implement this in 
the P2P code, but I don't think it belongs in the generic top-level 
scatterlist API.

Robin.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 15/15] nvme-pci: Allow mmaping the CMB in userspace
  2020-11-06 17:00 ` [RFC PATCH 15/15] nvme-pci: Allow mmaping the CMB in userspace Logan Gunthorpe
  2020-11-06 17:39   ` Jason Gunthorpe
@ 2020-11-09 15:03   ` Keith Busch
  2020-11-09 16:50     ` Logan Gunthorpe
  1 sibling, 1 reply; 42+ messages in thread
From: Keith Busch @ 2020-11-09 15:03 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter

On Fri, Nov 06, 2020 at 10:00:36AM -0700, Logan Gunthorpe wrote:
> Allow userspace to obtain CMB memory by mmaping the controller's
> char device. The mmap call allocates and returns a hunk of CMB memory,
> (the offset is ignored) so userspace does not have control over the
> address within the CMB.
> 
> A VMA allocated in this way will only be usable by drivers that set
> FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be
> checked the first time the pages are mapped for DMA.
> 
> Currently this is only supported by O_DIRECT to an PCI NVMe device
> or through the NVMe passthrough IOCTL.

Rather than make this be specific to nvme, could pci p2pdma create an
mmap'able file for any resource registered with it?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  2020-11-09  9:12   ` Christoph Hellwig
  2020-11-09 14:02     ` Robin Murphy
@ 2020-11-09 16:47     ` Logan Gunthorpe
  2020-12-10  1:22       ` Dan Williams
  1 sibling, 1 reply; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-09 16:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Dan Williams, Jason Gunthorpe,
	Christian König, Ira Weiny, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter



On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
> On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
>> We make use of the top bit of the dma_length to indicate a P2PDMA
>> segment.
> 
> I don't think "we" can.  There is nothing limiting the size of a SGL
> segment.

Yes, I expected this would be the unacceptable part. Any alternative ideas?

Logan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 15/15] nvme-pci: Allow mmaping the CMB in userspace
  2020-11-09 15:03   ` Keith Busch
@ 2020-11-09 16:50     ` Logan Gunthorpe
  0 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-09 16:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter



On 2020-11-09 8:03 a.m., Keith Busch wrote:
> On Fri, Nov 06, 2020 at 10:00:36AM -0700, Logan Gunthorpe wrote:
>> Allow userspace to obtain CMB memory by mmaping the controller's
>> char device. The mmap call allocates and returns a hunk of CMB memory,
>> (the offset is ignored) so userspace does not have control over the
>> address within the CMB.
>>
>> A VMA allocated in this way will only be usable by drivers that set
>> FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be
>> checked the first time the pages are mapped for DMA.
>>
>> Currently this is only supported by O_DIRECT to an PCI NVMe device
>> or through the NVMe passthrough IOCTL.
> 
> Rather than make this be specific to nvme, could pci p2pdma create an
> mmap'able file for any resource registered with it?

It's certainly possible. However, other people have been arguing that
more of this should be specific to NVMe as some use cases do not want to
use the genalloc inside p2pdma.

Logan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 03/15] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset()
  2020-11-06 17:00 ` [RFC PATCH 03/15] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset() Logan Gunthorpe
@ 2020-11-10 23:25   ` Bjorn Helgaas
  2020-11-10 23:42     ` Logan Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Bjorn Helgaas @ 2020-11-10 23:25 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter

On Fri, Nov 06, 2020 at 10:00:24AM -0700, Logan Gunthorpe wrote:
> Introduce pci_p2pdma_should_map_bus() which is meant to be called by
> dma map functions to determine how to map a given p2pdma page.

s/dma/DMA/ for consistency (also below in function comment)

> pci_p2pdma_bus_offset() is also added to allow callers to get the bus
> offset if they need to map the bus address.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/p2pdma.c       | 46 ++++++++++++++++++++++++++++++++++++++
>  include/linux/pci-p2pdma.h | 11 +++++++++
>  2 files changed, 57 insertions(+)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ea8472278b11..9961e779f430 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -930,6 +930,52 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>  
> +/**
> + * pci_p2pdma_bus_offset - returns the bus offset for a given page
> + * @page: page to get the offset for
> + *
> + * Must be passed a pci p2pdma page.

s/pci/PCI/

> + */
> +u64 pci_p2pdma_bus_offset(struct page *page)
> +{
> +	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page->pgmap);
> +
> +	WARN_ON(!is_pci_p2pdma_page(page));
> +
> +	return p2p_pgmap->bus_offset;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_bus_offset);
> +
> +/**
> + * pci_p2pdma_should_map_bus - determine if a dma mapping should use the
> + *	bus address
> + * @dev: device doing the DMA request
> + * @pgmap: dev_pagemap structure for the mapping
> + *
> + * Returns 1 if the page should be mapped with a bus address, 0 otherwise
> + * and -1 the device should not be mapping P2PDMA pages.

I think this is missing a word.

I'm not really sure how to interpret the "should" in
pci_p2pdma_should_map_bus().  If this returns -1, does that mean the
patches *cannot* be mapped?  They *could* be mapped, but you really
*shouldn't*?  Something else?

1 means page should be mapped with bus address.  0 means ... what,
exactly?  It should be mapped with some different address?

Sorry these are naive questions because I don't know how all this
works.

> + */
> +int pci_p2pdma_should_map_bus(struct device *dev, struct dev_pagemap *pgmap)
> +{
> +	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(pgmap);
> +	struct pci_dev *client;
> +
> +	if (!dev_is_pci(dev))
> +		return -1;
> +
> +	client = to_pci_dev(dev);
> +
> +	switch (pci_p2pdma_map_type(p2p_pgmap->provider, client)) {
> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> +		return 0;
> +	case PCI_P2PDMA_MAP_BUS_ADDR:
> +		return 1;
> +	default:
> +		return -1;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_should_map_bus);
> +
>  /**
>   * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
>   *		to enable p2pdma
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 8318a97c9c61..fc5de47eeac4 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -34,6 +34,8 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
>  		int nents, enum dma_data_direction dir, unsigned long attrs);
>  void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>  		int nents, enum dma_data_direction dir, unsigned long attrs);
> +u64 pci_p2pdma_bus_offset(struct page *page);
> +int pci_p2pdma_should_map_bus(struct device *dev, struct dev_pagemap *pgmap);
>  int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
>  			    bool *use_p2pdma);
>  ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> @@ -83,6 +85,15 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
>  static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>  {
>  }
> +static inline u64 pci_p2pdma_bus_offset(struct page *page)
> +{
> +	return -1;
> +}
> +static inline int pci_p2pdma_should_map_bus(struct device *dev,
> +					    struct dev_pagemap *pgmap)
> +{
> +	return -1;
> +}
>  static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
>  		struct scatterlist *sg, int nents, enum dma_data_direction dir,
>  		unsigned long attrs)
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 03/15] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset()
  2020-11-10 23:25   ` Bjorn Helgaas
@ 2020-11-10 23:42     ` Logan Gunthorpe
  0 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-11-10 23:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, Ira Weiny, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter



On 2020-11-10 4:25 p.m., Bjorn Helgaas wrote:
> On Fri, Nov 06, 2020 at 10:00:24AM -0700, Logan Gunthorpe wrote:
>> Introduce pci_p2pdma_should_map_bus() which is meant to be called by
>> dma map functions to determine how to map a given p2pdma page.
> 
> s/dma/DMA/ for consistency (also below in function comment)
> 
>> pci_p2pdma_bus_offset() is also added to allow callers to get the bus
>> offset if they need to map the bus address.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>> ---
>>  drivers/pci/p2pdma.c       | 46 ++++++++++++++++++++++++++++++++++++++
>>  include/linux/pci-p2pdma.h | 11 +++++++++
>>  2 files changed, 57 insertions(+)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index ea8472278b11..9961e779f430 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -930,6 +930,52 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>>  }
>>  EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>>  
>> +/**
>> + * pci_p2pdma_bus_offset - returns the bus offset for a given page
>> + * @page: page to get the offset for
>> + *
>> + * Must be passed a pci p2pdma page.
> 
> s/pci/PCI/
> 
>> + */
>> +u64 pci_p2pdma_bus_offset(struct page *page)
>> +{
>> +	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page->pgmap);
>> +
>> +	WARN_ON(!is_pci_p2pdma_page(page));
>> +
>> +	return p2p_pgmap->bus_offset;
>> +}
>> +EXPORT_SYMBOL_GPL(pci_p2pdma_bus_offset);
>> +
>> +/**
>> + * pci_p2pdma_should_map_bus - determine if a dma mapping should use the
>> + *	bus address
>> + * @dev: device doing the DMA request
>> + * @pgmap: dev_pagemap structure for the mapping
>> + *
>> + * Returns 1 if the page should be mapped with a bus address, 0 otherwise
>> + * and -1 the device should not be mapping P2PDMA pages.
> 
> I think this is missing a word.
> 
> I'm not really sure how to interpret the "should" in
> pci_p2pdma_should_map_bus().  If this returns -1, does that mean the
> patches *cannot* be mapped?  They *could* be mapped, but you really
> *shouldn't*?  Something else?
> 
> 1 means page should be mapped with bus address.  0 means ... what,
> exactly?  It should be mapped with some different address?

1 means it must be mapped with a bus address
0 means it may be mapped normally (through the IOMMU or just with a
direct physical address)
-1 means it cannot be mapped and should fail (ie. if it must go through
the IOMMU, but the IOMMU is not in the whitelist).

> Sorry these are naive questions because I don't know how all this
> works.

Thanks for the review. Definitely points out some questionable language
that I used. I'll reword this if/when it goes further.

Logan


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  2020-11-09 16:47     ` Logan Gunthorpe
@ 2020-12-10  1:22       ` Dan Williams
  2020-12-10  2:06         ` Logan Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Dan Williams @ 2020-12-10  1:22 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, linux-nvme,
	linux-block, Linux PCI, Linux MM, open list:DMA MAPPING HELPERS,
	Stephen Bates, Jason Gunthorpe, Christian König, Ira Weiny,
	John Hubbard, Don Dutile, Matthew Wilcox, Daniel Vetter

On Mon, Nov 9, 2020 at 8:47 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
> > On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
> >> We make use of the top bit of the dma_length to indicate a P2PDMA
> >> segment.
> >
> > I don't think "we" can.  There is nothing limiting the size of a SGL
> > segment.
>
> Yes, I expected this would be the unacceptable part. Any alternative ideas?

Why is the SG_P2PDMA_FLAG needed as compared to checking the SGL
segment-pages for is_pci_p2pdma_page()?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  2020-12-10  1:22       ` Dan Williams
@ 2020-12-10  2:06         ` Logan Gunthorpe
  2020-12-10  4:04           ` Dan Williams
  0 siblings, 1 reply; 42+ messages in thread
From: Logan Gunthorpe @ 2020-12-10  2:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Linux Kernel Mailing List, linux-nvme,
	linux-block, Linux PCI, Linux MM, open list:DMA MAPPING HELPERS,
	Stephen Bates, Jason Gunthorpe, Christian König, Ira Weiny,
	John Hubbard, Don Dutile, Matthew Wilcox, Daniel Vetter



On 2020-12-09 6:22 p.m., Dan Williams wrote:
> On Mon, Nov 9, 2020 at 8:47 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>>
>>
>> On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
>>> On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
>>>> We make use of the top bit of the dma_length to indicate a P2PDMA
>>>> segment.
>>>
>>> I don't think "we" can.  There is nothing limiting the size of a SGL
>>> segment.
>>
>> Yes, I expected this would be the unacceptable part. Any alternative ideas?
> 
> Why is the SG_P2PDMA_FLAG needed as compared to checking the SGL
> segment-pages for is_pci_p2pdma_page()?

Because the DMA and page segments in the SGL aren't necessarily aligned...

The IOMMU implementations can coalesce multiple pages into fewer DMA
address ranges, so the page pointed to by sg->page_link may not be the
one that corresponds to the address in sg->dma_address for a given segment.

If that makes sense -- it's not the easiest thing to explain.

Logan



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  2020-12-10  2:06         ` Logan Gunthorpe
@ 2020-12-10  4:04           ` Dan Williams
  2020-12-10 16:44             ` Logan Gunthorpe
  0 siblings, 1 reply; 42+ messages in thread
From: Dan Williams @ 2020-12-10  4:04 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Linux Kernel Mailing List, linux-nvme,
	linux-block, Linux PCI, Linux MM, open list:DMA MAPPING HELPERS,
	Stephen Bates, Jason Gunthorpe, Christian König, Ira Weiny,
	John Hubbard, Don Dutile, Matthew Wilcox, Daniel Vetter

On Wed, Dec 9, 2020 at 6:07 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2020-12-09 6:22 p.m., Dan Williams wrote:
> > On Mon, Nov 9, 2020 at 8:47 AM Logan Gunthorpe <logang@deltatee.com> wrote:
> >>
> >>
> >>
> >> On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
> >>> On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
> >>>> We make use of the top bit of the dma_length to indicate a P2PDMA
> >>>> segment.
> >>>
> >>> I don't think "we" can.  There is nothing limiting the size of a SGL
> >>> segment.
> >>
> >> Yes, I expected this would be the unacceptable part. Any alternative ideas?
> >
> > Why is the SG_P2PDMA_FLAG needed as compared to checking the SGL
> > segment-pages for is_pci_p2pdma_page()?
>
> Because the DMA and page segments in the SGL aren't necessarily aligned...
>
> The IOMMU implementations can coalesce multiple pages into fewer DMA
> address ranges, so the page pointed to by sg->page_link may not be the
> one that corresponds to the address in sg->dma_address for a given segment.
>
> If that makes sense -- it's not the easiest thing to explain.

It does...

Did someone already grab, or did you already consider the 3rd
available bit in page_link? AFAICS only SG_CHAIN and SG_END are
reserved. However, if you have a CONFIG_64BIT dependency for
user-directed p2pdma that would seem to allow SG_P2PDMA_FLAG to be
(0x4) in page_link.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
  2020-12-10  4:04           ` Dan Williams
@ 2020-12-10 16:44             ` Logan Gunthorpe
  0 siblings, 0 replies; 42+ messages in thread
From: Logan Gunthorpe @ 2020-12-10 16:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Linux Kernel Mailing List, linux-nvme,
	linux-block, Linux PCI, Linux MM, open list:DMA MAPPING HELPERS,
	Stephen Bates, Jason Gunthorpe, Christian König,
	John Hubbard, Don Dutile, Matthew Wilcox, Daniel Vetter



On 2020-12-09 9:04 p.m., Dan Williams wrote:
> On Wed, Dec 9, 2020 at 6:07 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>>
>>
>> On 2020-12-09 6:22 p.m., Dan Williams wrote:
>>> On Mon, Nov 9, 2020 at 8:47 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
>>>>> On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
>>>>>> We make use of the top bit of the dma_length to indicate a P2PDMA
>>>>>> segment.
>>>>>
>>>>> I don't think "we" can.  There is nothing limiting the size of a SGL
>>>>> segment.
>>>>
>>>> Yes, I expected this would be the unacceptable part. Any alternative ideas?
>>>
>>> Why is the SG_P2PDMA_FLAG needed as compared to checking the SGL
>>> segment-pages for is_pci_p2pdma_page()?
>>
>> Because the DMA and page segments in the SGL aren't necessarily aligned...
>>
>> The IOMMU implementations can coalesce multiple pages into fewer DMA
>> address ranges, so the page pointed to by sg->page_link may not be the
>> one that corresponds to the address in sg->dma_address for a given segment.
>>
>> If that makes sense -- it's not the easiest thing to explain.
> 
> It does...
> 
> Did someone already grab, or did you already consider the 3rd
> available bit in page_link? AFAICS only SG_CHAIN and SG_END are
> reserved. However, if you have a CONFIG_64BIT dependency for
> user-directed p2pdma that would seem to allow SG_P2PDMA_FLAG to be
> (0x4) in page_link.

Hmm, I half considered that, but I had came to the conclusion that given
the mis-alignment I shouldn't be using the page side of the SGL.
However, reconsidering now, that might actually be a reasonable option.

However, the CONFIG_64BIT dependency would have to be on all P2PDMA,
because we'd need to replace pci_p2pdma_map_sg() in all cases. I'm not
sure if this would be a restriction people care about.

Thanks,

Logan


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2020-12-10 16:45 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-06 17:00 [RFC PATCH 00/15] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 01/15] PCI/P2PDMA: Don't sleep in upstream_bridge_distance_warn() Logan Gunthorpe
2020-11-09  9:12   ` Christoph Hellwig
2020-11-06 17:00 ` [RFC PATCH 02/15] PCI/P2PDMA: Attempt to set map_type if it has not been set Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 03/15] PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and pci_p2pdma_bus_offset() Logan Gunthorpe
2020-11-10 23:25   ` Bjorn Helgaas
2020-11-10 23:42     ` Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 04/15] lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
2020-11-09  9:12   ` Christoph Hellwig
2020-11-09 14:02     ` Robin Murphy
2020-11-09 16:47     ` Logan Gunthorpe
2020-12-10  1:22       ` Dan Williams
2020-12-10  2:06         ` Logan Gunthorpe
2020-12-10  4:04           ` Dan Williams
2020-12-10 16:44             ` Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 05/15] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 06/15] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 07/15] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 08/15] nvme-pci: Check DMA ops when indicating support for PCI P2PDMA Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 09/15] nvme-pci: Convert to using dma_map_sg for p2pdma pages Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 10/15] mm: Introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 11/15] iov_iter: Introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 12/15] block: Set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 13/15] block: Set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 14/15] PCI/P2PDMA: Introduce pci_mmap_p2pmem() Logan Gunthorpe
2020-11-06 17:22   ` Jason Gunthorpe
2020-11-06 17:28     ` Logan Gunthorpe
2020-11-06 17:42       ` Jason Gunthorpe
2020-11-06 17:53         ` Logan Gunthorpe
2020-11-06 18:09           ` Jason Gunthorpe
2020-11-06 18:20             ` Logan Gunthorpe
2020-11-06 19:30               ` Jason Gunthorpe
2020-11-06 19:44                 ` Logan Gunthorpe
2020-11-06 19:53                   ` Jason Gunthorpe
2020-11-06 20:03                     ` Logan Gunthorpe
2020-11-07  0:14                       ` Jason Gunthorpe
2020-11-07  2:50                         ` Logan Gunthorpe
2020-11-06 17:00 ` [RFC PATCH 15/15] nvme-pci: Allow mmaping the CMB in userspace Logan Gunthorpe
2020-11-06 17:39   ` Jason Gunthorpe
2020-11-06 17:43     ` Logan Gunthorpe
2020-11-09 15:03   ` Keith Busch
2020-11-09 16:50     ` Logan Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).