linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices
@ 2021-09-16 23:40 Logan Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
                   ` (20 more replies)
  0 siblings, 21 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Hi,

This patchset continues my work to add userspace P2PDMA access using
O_DIRECT NVMe devices. My last posting[1] just included the first 13
patches in this series, but the early P2PDMA cleanup and map_sg error
changes from that series have been merged into v5.15-rc1. To address
concerns that that series did not add any new functionality, I've added
back the userspcae functionality from the original RFC[2] (but improved
based on the original feedback).

The patchset enables userspace P2PDMA by allowing userspace to mmap()
allocated chunks of the CMB. The resulting VMA can be passed only
to O_DIRECT IO on NVMe backed files or block devices. A flag is added
to GUP() in Patch 14, then Patches 15 through 17 wire this flag up based
on whether the block queue indicates P2PDMA support. Patches 18
through 20 enable the CMB to be mapped into userspace by mmaping
the nvme char device.

This is relatively straightforward, however the one significant
problem is that, presently, pci_p2pdma_map_sg() requires a homogeneous
SGL with all P2PDMA pages or all regular pages. Enhancing GUP to
support enforcing this rule would require a huge hack that I don't
expect would be all that pallatable. So the first 13 patches add
support for P2PDMA pages to dma_map_sg[table]() to the dma-direct
and dma-iommu implementations. Thus systems without an IOMMU plus
Intel and AMD IOMMUs are supported. (Other IOMMU implementations would
then be unsupported, notably ARM and PowerPC but support would be added
when they convert to dma-iommu).

dma_map_sgtable() is preferred when dealing with P2PDMA memory as it
will return -EREMOTEIO when the DMA device cannot map specific P2PDMA
pages based on the existing rules in calc_map_type_and_dist().

The other issue is dma_unmap_sg() needs a flag to determine whether a
given dma_addr_t was mapped regularly or as a PCI bus address. To allow
this, a third flag is added to the page_link field in struct
scatterlist. This effectively means support for P2PDMA will now depend
on CONFIG_64BIT.

Feedback welcome.

This series is based on v5.15-rc1. A git branch is available here:

  https://github.com/sbates130272/linux-p2pmem/  p2pdma_user_cmb_v3

Thanks,

Logan

[1] https://lkml.kernel.org/r/20210513223203.5542-1-logang@deltatee.com
[2] https://lkml.kernel.org/r/20201106170036.18713-1-logang@deltatee.com

--

Logan Gunthorpe (20):
  lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  PCI/P2PDMA: attempt to set map_type if it has not been set
  PCI/P2PDMA: make pci_p2pdma_map_type() non-static
  PCI/P2PDMA: introduce helpers for dma_map_sg implementations
  dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
  dma-direct: support PCI P2PDMA pages in dma-direct map_sg
  dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
  iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
  nvme-pci: check DMA ops when indicating support for PCI P2PDMA
  nvme-pci: convert to using dma_map_sgtable()
  RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
  RDMA/rw: use dma_map_sgtable()
  PCI/P2PDMA: remove pci_p2pdma_[un]map_sg()
  mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
  block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
  block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
  mm: use custom page_free for P2PDMA pages
  PCI/P2PDMA: introduce pci_mmap_p2pmem()
  nvme-pci: allow mmaping the CMB in userspace

 block/bio.c                  |   8 +-
 block/blk-map.c              |   7 +-
 drivers/dax/super.c          |   7 +-
 drivers/infiniband/core/rw.c |  75 +++----
 drivers/iommu/dma-iommu.c    |  68 +++++-
 drivers/nvme/host/core.c     |  18 +-
 drivers/nvme/host/nvme.h     |   4 +-
 drivers/nvme/host/pci.c      |  98 +++++----
 drivers/nvme/target/rdma.c   |   2 +-
 drivers/pci/Kconfig          |   3 +-
 drivers/pci/p2pdma.c         | 402 +++++++++++++++++++++++++++++------
 include/linux/dma-map-ops.h  |  10 +
 include/linux/dma-mapping.h  |   5 +
 include/linux/memremap.h     |   4 +-
 include/linux/mm.h           |   2 +
 include/linux/pci-p2pdma.h   |  92 ++++++--
 include/linux/scatterlist.h  |  50 ++++-
 include/linux/uio.h          |  21 +-
 include/rdma/ib_verbs.h      |  30 +++
 include/uapi/linux/magic.h   |   1 +
 kernel/dma/direct.c          |  44 +++-
 kernel/dma/mapping.c         |  34 ++-
 lib/iov_iter.c               |  28 +--
 mm/gup.c                     |  28 ++-
 mm/huge_memory.c             |   8 +-
 mm/memory-failure.c          |   4 +-
 mm/memory_hotplug.c          |   2 +-
 mm/memremap.c                |  26 ++-
 28 files changed, 834 insertions(+), 247 deletions(-)


base-commit: 6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f
--
2.30.2

^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-28 18:32   ` Jason Gunthorpe
                     ` (2 more replies)
  2021-09-16 23:40 ` [PATCH v3 02/20] PCI/P2PDMA: attempt to set map_type if it has not been set Logan Gunthorpe
                   ` (19 subsequent siblings)
  20 siblings, 3 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Make use of the third free LSB in scatterlist's page_link on 64bit systems.

The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
given SGL segments dma_address points to a PCI bus address.
dma_unmap_sg_p2pdma() will need to perform different cleanup when a
segment is marked as P2PDMA.

Using this bit requires adding an additional dependency on CONFIG_64BIT to
CONFIG_PCI_P2PDMA. This should be acceptable as the majority of P2PDMA
use cases are restricted to newer root complexes and roughly require the
extra address space for memory BARs used in the transactions.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/pci/Kconfig         |  2 +-
 include/linux/scatterlist.h | 50 ++++++++++++++++++++++++++++++++++---
 2 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 0c473d75e625..90b4bddb3300 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -163,7 +163,7 @@ config PCI_PASID
 
 config PCI_P2PDMA
 	bool "PCI peer-to-peer transfer support"
-	depends on ZONE_DEVICE
+	depends on ZONE_DEVICE && 64BIT
 	select GENERIC_ALLOCATOR
 	help
 	  Enableѕ drivers to do PCI peer-to-peer transactions to and from
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 266754a55327..e62b1cf6386f 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -64,6 +64,21 @@ struct sg_append_table {
 #define SG_CHAIN	0x01UL
 #define SG_END		0x02UL
 
+/*
+ * bit 2 is the third free bit in the page_link on 64bit systems which
+ * is used by dma_unmap_sg() to determine if the dma_address is a PCI
+ * bus address when doing P2PDMA.
+ * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
+ */
+
+#ifdef CONFIG_PCI_P2PDMA
+#define SG_DMA_PCI_P2PDMA	0x04UL
+#else
+#define SG_DMA_PCI_P2PDMA	0x00UL
+#endif
+
+#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_PCI_P2PDMA)
+
 /*
  * We overload the LSB of the page pointer to indicate whether it's
  * a valid sg entry, or whether it points to the start of a new scatterlist.
@@ -72,7 +87,9 @@ struct sg_append_table {
 #define sg_is_chain(sg)		((sg)->page_link & SG_CHAIN)
 #define sg_is_last(sg)		((sg)->page_link & SG_END)
 #define sg_chain_ptr(sg)	\
-	((struct scatterlist *) ((sg)->page_link & ~(SG_CHAIN | SG_END)))
+	((struct scatterlist *)((sg)->page_link & ~SG_PAGE_LINK_MASK))
+
+#define sg_is_dma_pci_p2pdma(sg) ((sg)->page_link & SG_DMA_PCI_P2PDMA)
 
 /**
  * sg_assign_page - Assign a given page to an SG entry
@@ -86,13 +103,13 @@ struct sg_append_table {
  **/
 static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
 {
-	unsigned long page_link = sg->page_link & (SG_CHAIN | SG_END);
+	unsigned long page_link = sg->page_link & SG_PAGE_LINK_MASK;
 
 	/*
 	 * In order for the low bit stealing approach to work, pages
 	 * must be aligned at a 32-bit boundary as a minimum.
 	 */
-	BUG_ON((unsigned long) page & (SG_CHAIN | SG_END));
+	BUG_ON((unsigned long)page & SG_PAGE_LINK_MASK);
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg_is_chain(sg));
 #endif
@@ -126,7 +143,7 @@ static inline struct page *sg_page(struct scatterlist *sg)
 #ifdef CONFIG_DEBUG_SG
 	BUG_ON(sg_is_chain(sg));
 #endif
-	return (struct page *)((sg)->page_link & ~(SG_CHAIN | SG_END));
+	return (struct page *)((sg)->page_link & ~SG_PAGE_LINK_MASK);
 }
 
 /**
@@ -228,6 +245,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
 	sg->page_link &= ~SG_END;
 }
 
+/**
+ * sg_dma_mark_pci_p2pdma - Mark the scatterlist entry for PCI p2pdma
+ * @sg:		 SG entryScatterlist
+ *
+ * Description:
+ *   Marks the passed in sg entry to indicate that the dma_address is
+ *   a PCI bus address.
+ **/
+static inline void sg_dma_mark_pci_p2pdma(struct scatterlist *sg)
+{
+	sg->page_link |= SG_DMA_PCI_P2PDMA;
+}
+
+/**
+ * sg_unmark_pci_p2pdma - Unmark the scatterlist entry for PCI p2pdma
+ * @sg:		 SG entryScatterlist
+ *
+ * Description:
+ *   Clears the PCI P2PDMA mark
+ **/
+static inline void sg_dma_unmark_pci_p2pdma(struct scatterlist *sg)
+{
+	sg->page_link &= ~SG_DMA_PCI_P2PDMA;
+}
+
 /**
  * sg_phys - Return physical address of an sg entry
  * @sg:	     SG entry
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 02/20] PCI/P2PDMA: attempt to set map_type if it has not been set
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-27 18:50   ` Bjorn Helgaas
  2021-09-16 23:40 ` [PATCH v3 03/20] PCI/P2PDMA: make pci_p2pdma_map_type() non-static Logan Gunthorpe
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Attempt to find the mapping type for P2PDMA pages on the first
DMA map attempt if it has not been done ahead of time.

Previously, the mapping type was expected to be calculated ahead of
time, but if pages are to come from userspace then there's no
way to ensure the path was checked ahead of time.

With this change it's no longer invalid to call pci_p2pdma_map_sg()
before the mapping type is calculated so drop the WARN_ON when that
is the case.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 50cdde3e9a8b..1192c465ba6d 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -848,6 +848,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 	struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
 	struct pci_dev *client;
 	struct pci_p2pdma *p2pdma;
+	int dist;
 
 	if (!provider->p2pdma)
 		return PCI_P2PDMA_MAP_NOT_SUPPORTED;
@@ -864,6 +865,10 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 		type = xa_to_value(xa_load(&p2pdma->map_types,
 					   map_types_idx(client)));
 	rcu_read_unlock();
+
+	if (type == PCI_P2PDMA_MAP_UNKNOWN)
+		return calc_map_type_and_dist(provider, client, &dist, false);
+
 	return type;
 }
 
@@ -906,7 +911,6 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
 	case PCI_P2PDMA_MAP_BUS_ADDR:
 		return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
 	default:
-		WARN_ON_ONCE(1);
 		return 0;
 	}
 }
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 03/20] PCI/P2PDMA: make pci_p2pdma_map_type() non-static
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 02/20] PCI/P2PDMA: attempt to set map_type if it has not been set Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-27 18:46   ` Bjorn Helgaas
  2021-09-28 18:48   ` Jason Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations Logan Gunthorpe
                   ` (17 subsequent siblings)
  20 siblings, 2 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

pci_p2pdma_map_type() will be needed by the dma-iommu map_sg
implementation because it will need to determine the mapping type
ahead of actually doing the mapping to create the actual iommu mapping.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c       | 24 +++++++++++++---------
 include/linux/pci-p2pdma.h | 41 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 1192c465ba6d..b656d8c801a7 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -20,13 +20,6 @@
 #include <linux/seq_buf.h>
 #include <linux/xarray.h>
 
-enum pci_p2pdma_map_type {
-	PCI_P2PDMA_MAP_UNKNOWN = 0,
-	PCI_P2PDMA_MAP_NOT_SUPPORTED,
-	PCI_P2PDMA_MAP_BUS_ADDR,
-	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
 struct pci_p2pdma {
 	struct gen_pool *pool;
 	bool p2pmem_published;
@@ -841,8 +834,21 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
-						    struct device *dev)
+/**
+ * pci_p2pdma_map_type - return the type of mapping that should be used for
+ *	a given device and pgmap
+ * @pgmap: the pagemap of a page to determine the mapping type for
+ * @dev: device that is mapping the page
+ *
+ * Returns one of:
+ *	PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done
+ *	PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address
+ *	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done normally
+ *		using the CPU physical address (in dma-direct) or an IOVA
+ *		mapping for the IOMMU.
+ */
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+					     struct device *dev)
 {
 	enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
 	struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..caac2d023f8f 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -16,6 +16,40 @@
 struct block_device;
 struct scatterlist;
 
+enum pci_p2pdma_map_type {
+	/*
+	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
+	 * type hasn't been calculated yet. Functions that return this enum
+	 * never return this value.
+	 */
+	PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+	/*
+	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+	 * traverse the host bridge and the host bridge is not in the
+	 * whitelist. DMA Mapping routines should return an error when
+	 * this is returned.
+	 */
+	PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+	/*
+	 * PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
+	 * eachother directly through a PCI switch and the transaction will
+	 * not traverse the host bridge. Such a mapping should program
+	 * the DMA engine with PCI bus addresses.
+	 */
+	PCI_P2PDMA_MAP_BUS_ADDR,
+
+	/*
+	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+	 * to eachother, but the transaction traverses a host bridge on the
+	 * whitelist. In this case, a normal mapping either with CPU physical
+	 * addresses (in the case of dma-direct) or IOVA addresses (in the
+	 * case of IOMMUs) should be used to program the DMA engine.
+	 */
+	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
@@ -30,6 +64,8 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
 					 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+					     struct device *dev);
 int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs);
 void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
@@ -83,6 +119,11 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
+{
+	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
 static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
 		struct scatterlist *sg, int nents, enum dma_data_direction dir,
 		unsigned long attrs)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (2 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 03/20] PCI/P2PDMA: make pci_p2pdma_map_type() non-static Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-27 18:53   ` Bjorn Helgaas
                     ` (2 more replies)
  2021-09-16 23:40 ` [PATCH v3 05/20] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers Logan Gunthorpe
                   ` (16 subsequent siblings)
  20 siblings, 3 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
implementations. It takes an scatterlist segment that must point to a
pci_p2pdma struct page and will map it if the mapping requires a bus
address.

The return value indicates whether the mapping required a bus address
or whether the caller still needs to map the segment normally. If the
segment should not be mapped, -EREMOTEIO is returned.

This helper uses a state structure to track the changes to the
pgmap across calls and avoid needing to lookup into the xarray for
every page.

Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
dma_map_sg() implementations where the sg segment containing the page
differs from the sg segment containing the DMA address.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c       | 59 ++++++++++++++++++++++++++++++++++++++
 include/linux/pci-p2pdma.h | 21 ++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index b656d8c801a7..58c34f1f1473 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -943,6 +943,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
 
+/**
+ * pci_p2pdma_map_segment - map an sg segment determining the mapping type
+ * @state: State structure that should be declared outside of the for_each_sg()
+ *	loop and initialized to zero.
+ * @dev: DMA device that's doing the mapping operation
+ * @sg: scatterlist segment to map
+ *
+ * This is a helper to be used by non-iommu dma_map_sg() implementations where
+ * the sg segment is the same for the page_link and the dma_address.
+ *
+ * Attempt to map a single segment in an SGL with the PCI bus address.
+ * The segment must point to a PCI P2PDMA page and thus must be
+ * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
+ *
+ * Returns the type of mapping used and maps the page if the type is
+ * PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+		       struct scatterlist *sg)
+{
+	if (state->pgmap != sg_page(sg)->pgmap) {
+		state->pgmap = sg_page(sg)->pgmap;
+		state->map = pci_p2pdma_map_type(state->pgmap, dev);
+		state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
+	}
+
+	if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
+		sg->dma_address = sg_phys(sg) + state->bus_off;
+		sg_dma_len(sg) = sg->length;
+		sg_dma_mark_pci_p2pdma(sg);
+	}
+
+	return state->map;
+}
+
+/**
+ * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
+ *	be mapped with PCI_P2PDMA_MAP_BUS_ADDR
+ * @pg_sg: scatterlist segment with the page to map
+ * @dma_sg: scatterlist segment to assign a dma address to
+ *
+ * This is a helper for iommu dma_map_sg() implementations when the
+ * segment for the dma address differs from the segment containing the
+ * source page.
+ *
+ * pci_p2pdma_map_type() must have already been called on the pg_sg and
+ * returned PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+				struct scatterlist *dma_sg)
+{
+	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap);
+
+	dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset;
+	sg_dma_len(dma_sg) = pg_sg->length;
+	sg_dma_mark_pci_p2pdma(dma_sg);
+}
+
 /**
  * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
  *		to enable p2pdma
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index caac2d023f8f..e5a8d5bc0f51 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -13,6 +13,12 @@
 
 #include <linux/pci.h>
 
+struct pci_p2pdma_map_state {
+	struct dev_pagemap *pgmap;
+	int map;
+	u64 bus_off;
+};
+
 struct block_device;
 struct scatterlist;
 
@@ -70,6 +76,11 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs);
 void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs);
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+		       struct scatterlist *sg);
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+				struct scatterlist *dma_sg);
 int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 			    bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -135,6 +146,16 @@ static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
 		unsigned long attrs)
 {
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+		       struct scatterlist *sg)
+{
+	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
+static inline void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+					      struct scatterlist *dma_sg)
+{
+}
 static inline int pci_p2pdma_enable_store(const char *page,
 		struct pci_dev **p2p_dev, bool *use_p2pdma)
 {
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 05/20] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (3 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-28 18:57   ` Jason Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 06/20] dma-direct: support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Add EREMOTEIO error return to dma_map_sgtable() which will be used
by .map_sg() implementations that detect P2PDMA pages that the
underlying DMA device cannot access.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 kernel/dma/mapping.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 7ee5284bff58..7315ae31cf1d 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -197,7 +197,7 @@ static int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
 	if (ents > 0)
 		debug_dma_map_sg(dev, sg, nents, ents, dir);
 	else if (WARN_ON_ONCE(ents != -EINVAL && ents != -ENOMEM &&
-			      ents != -EIO))
+			      ents != -EIO && ents != -EREMOTEIO))
 		return -EIO;
 
 	return ents;
@@ -248,12 +248,14 @@ EXPORT_SYMBOL(dma_map_sg_attrs);
  * Returns 0 on success or a negative error code on error. The following
  * error codes are supported with the given meaning:
  *
- *   -EINVAL - An invalid argument, unaligned access or other error
- *	       in usage. Will not succeed if retried.
- *   -ENOMEM - Insufficient resources (like memory or IOVA space) to
- *	       complete the mapping. Should succeed if retried later.
- *   -EIO    - Legacy error code with an unknown meaning. eg. this is
- *	       returned if a lower level call returned DMA_MAPPING_ERROR.
+ *   -EINVAL	- An invalid argument, unaligned access or other error
+ *		  in usage. Will not succeed if retried.
+ *   -ENOMEM	- Insufficient resources (like memory or IOVA space) to
+ *		  complete the mapping. Should succeed if retried later.
+ *   -EIO	- Legacy error code with an unknown meaning. eg. this is
+ *	          returned if a lower level call returned DMA_MAPPING_ERROR.
+ *   -EREMOTEIO	- The DMA device cannot access P2PDMA memory specified in
+ *		  the sg_table. This will not succeed if retried.
  */
 int dma_map_sgtable(struct device *dev, struct sg_table *sgt,
 		    enum dma_data_direction dir, unsigned long attrs)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 06/20] dma-direct: support PCI P2PDMA pages in dma-direct map_sg
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (4 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 05/20] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-28 19:08   ` Jason Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 07/20] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through the
     root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
     normally, as though it were a CPU physical address
  3) It is not possible for the two devices to communicate and thus
     the mapping operation should fail (and it will return -EREMOTEIO).

SGL segments that contain PCI bus addresses are marked with
sg_dma_mark_pci_p2pdma() and are ignored when unmapped.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 kernel/dma/direct.c | 44 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 4c6c5e0635e3..fa8317e8ff44 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -13,6 +13,7 @@
 #include <linux/vmalloc.h>
 #include <linux/set_memory.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include "direct.h"
 
 /*
@@ -421,29 +422,60 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
 		arch_sync_dma_for_cpu_all();
 }
 
+/*
+ * Unmaps segments, except for ones marked as pci_p2pdma which do not
+ * require any further action as they contain a bus address.
+ */
 void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
 		int nents, enum dma_data_direction dir, unsigned long attrs)
 {
 	struct scatterlist *sg;
 	int i;
 
-	for_each_sg(sgl, sg, nents, i)
-		dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
-			     attrs);
+	for_each_sg(sgl, sg, nents, i) {
+		if (sg_is_dma_pci_p2pdma(sg)) {
+			sg_dma_unmark_pci_p2pdma(sg);
+		} else  {
+			dma_direct_unmap_page(dev, sg->dma_address,
+					      sg_dma_len(sg), dir, attrs);
+		}
+	}
 }
 #endif
 
 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 		enum dma_data_direction dir, unsigned long attrs)
 {
-	int i;
+	struct pci_p2pdma_map_state p2pdma_state = {};
+	enum pci_p2pdma_map_type map;
 	struct scatterlist *sg;
+	int i, ret;
 
 	for_each_sg(sgl, sg, nents, i) {
+		if (is_pci_p2pdma_page(sg_page(sg))) {
+			map = pci_p2pdma_map_segment(&p2pdma_state, dev, sg);
+			switch (map) {
+			case PCI_P2PDMA_MAP_BUS_ADDR:
+				continue;
+			case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+				/*
+				 * Mapping through host bridge should be
+				 * mapped normally, thus we do nothing
+				 * and continue below.
+				 */
+				break;
+			default:
+				ret = -EREMOTEIO;
+				goto out_unmap;
+			}
+		}
+
 		sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
 				sg->offset, sg->length, dir, attrs);
-		if (sg->dma_address == DMA_MAPPING_ERROR)
+		if (sg->dma_address == DMA_MAPPING_ERROR) {
+			ret = -EIO;
 			goto out_unmap;
+		}
 		sg_dma_len(sg) = sg->length;
 	}
 
@@ -451,7 +483,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 
 out_unmap:
 	dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-	return -EIO;
+	return ret;
 }
 
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 07/20] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (5 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 06/20] dma-direct: support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-28 19:11   ` Jason Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 08/20] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg Logan Gunthorpe
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.

Also, add a helper to check if a device supports PCI P2PDMA.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/dma-map-ops.h | 10 ++++++++++
 include/linux/dma-mapping.h |  5 +++++
 kernel/dma/mapping.c        | 18 ++++++++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..b60d6870c847 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -11,7 +11,17 @@
 
 struct cma;
 
+/*
+ * Values for struct dma_map_ops.flags:
+ *
+ * DMA_F_PCI_P2PDMA_SUPPORTED: Indicates the dma_map_ops implementation can
+ * handle PCI P2PDMA pages in the map_sg/unmap_sg operation.
+ */
+#define DMA_F_PCI_P2PDMA_SUPPORTED     (1 << 0)
+
 struct dma_map_ops {
+	unsigned int flags;
+
 	void *(*alloc)(struct device *dev, size_t size,
 			dma_addr_t *dma_handle, gfp_t gfp,
 			unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..f7c61b2b4b5e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -140,6 +140,7 @@ int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
 		unsigned long attrs);
 bool dma_can_mmap(struct device *dev);
 int dma_supported(struct device *dev, u64 mask);
+bool dma_pci_p2pdma_supported(struct device *dev);
 int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
@@ -250,6 +251,10 @@ static inline int dma_supported(struct device *dev, u64 mask)
 {
 	return 0;
 }
+static inline bool dma_pci_p2pdma_supported(struct device *dev)
+{
+	return false;
+}
 static inline int dma_set_mask(struct device *dev, u64 mask)
 {
 	return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 7315ae31cf1d..23a02fe1832a 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -727,6 +727,24 @@ int dma_supported(struct device *dev, u64 mask)
 }
 EXPORT_SYMBOL(dma_supported);
 
+bool dma_pci_p2pdma_supported(struct device *dev)
+{
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	/* if ops is not set, dma direct will be used which supports P2PDMA */
+	if (!ops)
+		return true;
+
+	/*
+	 * Note: dma_ops_bypass is not checked here because P2PDMA should
+	 * not be used with dma mapping ops that do not have support even
+	 * if the specific device is bypassing them.
+	 */
+
+	return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
+}
+EXPORT_SYMBOL_GPL(dma_pci_p2pdma_supported);
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
 void arch_dma_set_mask(struct device *dev, u64 mask);
 #else
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 08/20] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (6 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 07/20] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-28 19:15   ` Jason Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA Logan Gunthorpe
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through
     the root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
     normally with an IOMMU IOVA.
  3) It is not possible for the two devices to communicate and thus
     the mapping operation should fail (and it will return -EREMOTEIO).

Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is set to
DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/iommu/dma-iommu.c | 68 +++++++++++++++++++++++++++++++++++----
 1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 896bea04c347..e7c658d04222 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -20,6 +20,7 @@
 #include <linux/mm.h>
 #include <linux/mutex.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/swiotlb.h>
 #include <linux/scatterlist.h>
 #include <linux/vmalloc.h>
@@ -911,6 +912,16 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents,
 		sg_dma_address(s) = DMA_MAPPING_ERROR;
 		sg_dma_len(s) = 0;
 
+		if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
+			if (i > 0)
+				cur = sg_next(cur);
+
+			pci_p2pdma_map_bus_segment(s, cur);
+			count++;
+			cur_len = 0;
+			continue;
+		}
+
 		/*
 		 * Now fill in the real DMA data. If...
 		 * - there is a valid output segment to append to
@@ -1008,6 +1019,8 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	struct iova_domain *iovad = &cookie->iovad;
 	struct scatterlist *s, *prev = NULL;
 	int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
+	struct dev_pagemap *pgmap = NULL;
+	enum pci_p2pdma_map_type map_type;
 	dma_addr_t iova;
 	size_t iova_len = 0;
 	unsigned long mask = dma_get_seg_boundary(dev);
@@ -1042,6 +1055,35 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		s_length = iova_align(iovad, s_length + s_iova_off);
 		s->length = s_length;
 
+		if (is_pci_p2pdma_page(sg_page(s))) {
+			if (sg_page(s)->pgmap != pgmap) {
+				pgmap = sg_page(s)->pgmap;
+				map_type = pci_p2pdma_map_type(pgmap, dev);
+			}
+
+			switch (map_type) {
+			case PCI_P2PDMA_MAP_BUS_ADDR:
+				/*
+				 * A zero length will be ignored by
+				 * iommu_map_sg() and then can be detected
+				 * in __finalise_sg() to actually map the
+				 * bus address.
+				 */
+				s->length = 0;
+				continue;
+			case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+				/*
+				 * Mapping through host bridge should be
+				 * mapped with regular IOVAs, thus we
+				 * do nothing here and continue below.
+				 */
+				break;
+			default:
+				ret = -EREMOTEIO;
+				goto out_restore_sg;
+			}
+		}
+
 		/*
 		 * Due to the alignment of our single IOVA allocation, we can
 		 * depend on these assumptions about the segment boundary mask:
@@ -1064,6 +1106,9 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 		prev = s;
 	}
 
+	if (!iova_len)
+		return __finalise_sg(dev, sg, nents, 0);
+
 	iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
 	if (!iova) {
 		ret = -ENOMEM;
@@ -1085,7 +1130,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 out_restore_sg:
 	__invalidate_sg(sg, nents);
 out:
-	if (ret != -ENOMEM)
+	if (ret != -ENOMEM && ret != -EREMOTEIO)
 		return -EINVAL;
 	return ret;
 }
@@ -1093,7 +1138,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 		int nents, enum dma_data_direction dir, unsigned long attrs)
 {
-	dma_addr_t start, end;
+	dma_addr_t end, start = DMA_MAPPING_ERROR;
 	struct scatterlist *tmp;
 	int i;
 
@@ -1109,14 +1154,22 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
 	 * The scatterlist segments are mapped into a single
 	 * contiguous IOVA allocation, so this is incredibly easy.
 	 */
-	start = sg_dma_address(sg);
-	for_each_sg(sg_next(sg), tmp, nents - 1, i) {
+	for_each_sg(sg, tmp, nents, i) {
+		if (sg_is_dma_pci_p2pdma(tmp)) {
+			sg_dma_unmark_pci_p2pdma(tmp);
+			continue;
+		}
 		if (sg_dma_len(tmp) == 0)
 			break;
-		sg = tmp;
+
+		if (start == DMA_MAPPING_ERROR)
+			start = sg_dma_address(tmp);
+
+		end = sg_dma_address(tmp) + sg_dma_len(tmp);
 	}
-	end = sg_dma_address(sg) + sg_dma_len(sg);
-	__iommu_dma_unmap(dev, start, end - start);
+
+	if (start != DMA_MAPPING_ERROR)
+		__iommu_dma_unmap(dev, start, end - start);
 }
 
 static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
@@ -1309,6 +1362,7 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev)
 }
 
 static const struct dma_map_ops iommu_dma_ops = {
+	.flags			= DMA_F_PCI_P2PDMA_SUPPORTED,
 	.alloc			= iommu_dma_alloc,
 	.free			= iommu_dma_free,
 	.alloc_pages		= dma_common_alloc_pages,
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (7 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 08/20] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-30  5:06   ` Chaitanya Kulkarni
  2021-09-16 23:40 ` [PATCH v3 10/20] nvme-pci: convert to using dma_map_sgtable() Logan Gunthorpe
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
flags can be checked for PCI P2PDMA support.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  2 +-
 drivers/nvme/host/pci.c  | 11 +++++++++--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7efb31b87f37..916750a54f60 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3771,7 +3771,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
 		blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
 
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
-	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+	if (ctrl->ops->supports_pci_p2pdma &&
+	    ctrl->ops->supports_pci_p2pdma(ctrl))
 		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
 
 	ns->ctrl = ctrl;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9871c0c9374c..fb9bfc52a6d7 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -477,7 +477,6 @@ struct nvme_ctrl_ops {
 	unsigned int flags;
 #define NVME_F_FABRICS			(1 << 0)
 #define NVME_F_METADATA_SUPPORTED	(1 << 1)
-#define NVME_F_PCI_P2PDMA		(1 << 2)
 	int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 	int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -485,6 +484,7 @@ struct nvme_ctrl_ops {
 	void (*submit_async_event)(struct nvme_ctrl *ctrl);
 	void (*delete_ctrl)(struct nvme_ctrl *ctrl);
 	int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+	bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b82492cd7503..7d1ef66eac2e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2874,17 +2874,24 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
 	return snprintf(buf, size, "%s\n", dev_name(&pdev->dev));
 }
 
+static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
+{
+	struct nvme_dev *dev = to_nvme_dev(ctrl);
+
+	return dma_pci_p2pdma_supported(dev->dev);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
-	.flags			= NVME_F_METADATA_SUPPORTED |
-				  NVME_F_PCI_P2PDMA,
+	.flags			= NVME_F_METADATA_SUPPORTED,
 	.reg_read32		= nvme_pci_reg_read32,
 	.reg_write32		= nvme_pci_reg_write32,
 	.reg_read64		= nvme_pci_reg_read64,
 	.free_ctrl		= nvme_pci_free_ctrl,
 	.submit_async_event	= nvme_pci_submit_async_event,
 	.get_address		= nvme_pci_get_address,
+	.supports_pci_p2pdma	= nvme_pci_supports_pci_p2pdma,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 10/20] nvme-pci: convert to using dma_map_sgtable()
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (8 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-10-05 22:29   ` Max Gurtovoy
  2021-09-16 23:40 ` [PATCH v3 11/20] RDMA/core: introduce ib_dma_pci_p2p_dma_supported() Logan Gunthorpe
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

The dma_map operations now support P2PDMA pages directly. So remove
the calls to pci_p2pdma_[un]map_sg_attrs() and replace them with calls
to dma_map_sgtable().

dma_map_sgtable() returns more complete error codes than dma_map_sg()
and allows differentiating EREMOTEIO errors in case an unsupported
P2PDMA transfer is requested. When this happens, return BLK_STS_TARGET
so the request isn't retried.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/pci.c | 69 +++++++++++++++++------------------------
 1 file changed, 29 insertions(+), 40 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7d1ef66eac2e..e2cd73129a88 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -228,11 +228,10 @@ struct nvme_iod {
 	bool use_sgl;
 	int aborted;
 	int npages;		/* In the PRP list. 0 means small pool in use */
-	int nents;		/* Used in scatterlist */
 	dma_addr_t first_dma;
 	unsigned int dma_len;	/* length of single DMA segment mapping */
 	dma_addr_t meta_dma;
-	struct scatterlist *sg;
+	struct sg_table sgt;
 };
 
 static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -523,7 +522,7 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 static void **nvme_pci_iod_list(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
+	return (void **)(iod->sgt.sgl + blk_rq_nr_phys_segments(req));
 }
 
 static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
@@ -575,17 +574,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct request *req)
 	}
 }
 
-static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
-{
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-	if (is_pci_p2pdma_page(sg_page(iod->sg)))
-		pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
-				    rq_dma_dir(req));
-	else
-		dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-}
-
 static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -596,9 +584,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 		return;
 	}
 
-	WARN_ON_ONCE(!iod->nents);
+	WARN_ON_ONCE(!iod->sgt.nents);
+
+	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
 
-	nvme_unmap_sg(dev, req);
 	if (iod->npages == 0)
 		dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
 			      iod->first_dma);
@@ -606,7 +595,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 		nvme_free_sgls(dev, req);
 	else
 		nvme_free_prps(dev, req);
-	mempool_free(iod->sg, dev->iod_mempool);
+	mempool_free(iod->sgt.sgl, dev->iod_mempool);
 }
 
 static void nvme_print_sgl(struct scatterlist *sgl, int nents)
@@ -629,7 +618,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct dma_pool *pool;
 	int length = blk_rq_payload_bytes(req);
-	struct scatterlist *sg = iod->sg;
+	struct scatterlist *sg = iod->sgt.sgl;
 	int dma_len = sg_dma_len(sg);
 	u64 dma_addr = sg_dma_address(sg);
 	int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
@@ -702,16 +691,16 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 		dma_len = sg_dma_len(sg);
 	}
 done:
-	cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+	cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
 	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
 	return BLK_STS_OK;
 free_prps:
 	nvme_free_prps(dev, req);
 	return BLK_STS_RESOURCE;
 bad_sgl:
-	WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
+	WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
 			"Invalid SGL for payload:%d nents:%d\n",
-			blk_rq_payload_bytes(req), iod->nents);
+			blk_rq_payload_bytes(req), iod->sgt.nents);
 	return BLK_STS_IOERR;
 }
 
@@ -737,12 +726,13 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
 }
 
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmd, int entries)
+		struct request *req, struct nvme_rw_command *cmd)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct dma_pool *pool;
 	struct nvme_sgl_desc *sg_list;
-	struct scatterlist *sg = iod->sg;
+	struct scatterlist *sg = iod->sgt.sgl;
+	int entries = iod->sgt.nents;
 	dma_addr_t sgl_dma;
 	int i = 0;
 
@@ -840,7 +830,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	blk_status_t ret = BLK_STS_RESOURCE;
-	int nr_mapped;
+	int rc;
 
 	if (blk_rq_nr_phys_segments(req) == 1) {
 		struct bio_vec bv = req_bvec(req);
@@ -858,26 +848,25 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	}
 
 	iod->dma_len = 0;
-	iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
-	if (!iod->sg)
+	iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
+	if (!iod->sgt.sgl)
 		return BLK_STS_RESOURCE;
-	sg_init_table(iod->sg, blk_rq_nr_phys_segments(req));
-	iod->nents = blk_rq_map_sg(req->q, req, iod->sg);
-	if (!iod->nents)
+	sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req));
+	iod->sgt.orig_nents = blk_rq_map_sg(req->q, req, iod->sgt.sgl);
+	if (!iod->sgt.orig_nents)
 		goto out_free_sg;
 
-	if (is_pci_p2pdma_page(sg_page(iod->sg)))
-		nr_mapped = pci_p2pdma_map_sg_attrs(dev->dev, iod->sg,
-				iod->nents, rq_dma_dir(req), DMA_ATTR_NO_WARN);
-	else
-		nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
-					     rq_dma_dir(req), DMA_ATTR_NO_WARN);
-	if (!nr_mapped)
+	rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req),
+			     DMA_ATTR_NO_WARN);
+	if (rc) {
+		if (rc == -EREMOTEIO)
+			ret = BLK_STS_TARGET;
 		goto out_free_sg;
+	}
 
 	iod->use_sgl = nvme_pci_use_sgls(dev, req);
 	if (iod->use_sgl)
-		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw, nr_mapped);
+		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
 	else
 		ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
 	if (ret != BLK_STS_OK)
@@ -885,9 +874,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	return BLK_STS_OK;
 
 out_unmap_sg:
-	nvme_unmap_sg(dev, req);
+	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
 out_free_sg:
-	mempool_free(iod->sg, dev->iod_mempool);
+	mempool_free(iod->sgt.sgl, dev->iod_mempool);
 	return ret;
 }
 
@@ -920,7 +909,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	iod->aborted = 0;
 	iod->npages = -1;
-	iod->nents = 0;
+	iod->sgt.nents = 0;
 
 	/*
 	 * We should not need to do this, but we're still using this to
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 11/20] RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (9 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 10/20] nvme-pci: convert to using dma_map_sgtable() Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-28 19:17   ` Jason Gunthorpe
  2021-10-05 22:31   ` Max Gurtovoy
  2021-09-16 23:40 ` [PATCH v3 12/20] RDMA/rw: use dma_map_sgtable() Logan Gunthorpe
                   ` (9 subsequent siblings)
  20 siblings, 2 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Introduce the helper function ib_dma_pci_p2p_dma_supported() to check
if a given ib_device can be used in P2PDMA transfers. This ensures
the ib_device is not using virt_dma and also that the underlying
dma_device supports P2PDMA.

Use the new helper in nvme-rdma to replace the existing check for
ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows
switching away from pci_p2pdma_[un]map_sg().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/target/rdma.c |  2 +-
 include/rdma/ib_verbs.h    | 11 +++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 891174ccd44b..9ea212c187f2 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -414,7 +414,7 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device *ndev,
 	if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
 		goto out_free_rsp;
 
-	if (!ib_uses_virt_dma(ndev->device))
+	if (ib_dma_pci_p2p_dma_supported(ndev->device))
 		r->req.p2p_client = &ndev->device->dev;
 	r->send_sge.length = sizeof(*r->req.cqe);
 	r->send_sge.lkey = ndev->pd->local_dma_lkey;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 4b50d9a3018a..2b71c9ca2186 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -3986,6 +3986,17 @@ static inline bool ib_uses_virt_dma(struct ib_device *dev)
 	return IS_ENABLED(CONFIG_INFINIBAND_VIRT_DMA) && !dev->dma_device;
 }
 
+/*
+ * Check if a IB device's underlying DMA mapping supports P2PDMA transfers.
+ */
+static inline bool ib_dma_pci_p2p_dma_supported(struct ib_device *dev)
+{
+	if (ib_uses_virt_dma(dev))
+		return false;
+
+	return dma_pci_p2pdma_supported(dev->dma_device);
+}
+
 /**
  * ib_dma_mapping_error - check a DMA addr for error
  * @dev: The device for which the dma_addr was created
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 12/20] RDMA/rw: use dma_map_sgtable()
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (10 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 11/20] RDMA/core: introduce ib_dma_pci_p2p_dma_supported() Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-28 19:43   ` Jason Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 13/20] PCI/P2PDMA: remove pci_p2pdma_[un]map_sg() Logan Gunthorpe
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
is no longer necessary and may be dropped.

Switch to the dma_map_sgtable() interface which will allow for better
error reporting if the P2PDMA pages are unsupported.

The change to sgtable also appears to fix a couple subtle error path
bugs:

  - In rdma_rw_ctx_init(), dma_unmap would be called with an sg
    that could have been incremented from the original call, as
    well as an nents that was not the original number of nents
    called when mapped.
  - Similarly in rdma_rw_ctx_signature_init, both sg and prot_sg
    were unmapped with the incorrect number of nents.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/infiniband/core/rw.c | 75 +++++++++++++++---------------------
 include/rdma/ib_verbs.h      | 19 +++++++++
 2 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 5221cce65675..1bdb56380764 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -273,26 +273,6 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	return 1;
 }
 
-static void rdma_rw_unmap_sg(struct ib_device *dev, struct scatterlist *sg,
-			     u32 sg_cnt, enum dma_data_direction dir)
-{
-	if (is_pci_p2pdma_page(sg_page(sg)))
-		pci_p2pdma_unmap_sg(dev->dma_device, sg, sg_cnt, dir);
-	else
-		ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
-}
-
-static int rdma_rw_map_sg(struct ib_device *dev, struct scatterlist *sg,
-			  u32 sg_cnt, enum dma_data_direction dir)
-{
-	if (is_pci_p2pdma_page(sg_page(sg))) {
-		if (WARN_ON_ONCE(ib_uses_virt_dma(dev)))
-			return 0;
-		return pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
-	}
-	return ib_dma_map_sg(dev, sg, sg_cnt, dir);
-}
-
 /**
  * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
  * @ctx:	context to initialize
@@ -313,12 +293,16 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num,
 		u64 remote_addr, u32 rkey, enum dma_data_direction dir)
 {
 	struct ib_device *dev = qp->pd->device;
+	struct sg_table sgt = {
+		.sgl = sg,
+		.orig_nents = sg_cnt,
+	};
 	int ret;
 
-	ret = rdma_rw_map_sg(dev, sg, sg_cnt, dir);
-	if (!ret)
-		return -ENOMEM;
-	sg_cnt = ret;
+	ret = ib_dma_map_sgtable(dev, &sgt, dir, 0);
+	if (ret)
+		return ret;
+	sg_cnt = sgt.nents;
 
 	/*
 	 * Skip to the S/G entry that sg_offset falls into:
@@ -354,7 +338,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u32 port_num,
 	return ret;
 
 out_unmap_sg:
-	rdma_rw_unmap_sg(dev, sg, sg_cnt, dir);
+	ib_dma_unmap_sgtable(dev, &sgt, dir, 0);
 	return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_init);
@@ -387,6 +371,14 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 						    qp->integrity_en);
 	struct ib_rdma_wr *rdma_wr;
 	int count = 0, ret;
+	struct sg_table sgt = {
+		.sgl = sg,
+		.orig_nents = sg_cnt,
+	};
+	struct sg_table prot_sgt = {
+		.sgl = prot_sg,
+		.orig_nents = prot_sg_cnt,
+	};
 
 	if (sg_cnt > pages_per_mr || prot_sg_cnt > pages_per_mr) {
 		pr_err("SG count too large: sg_cnt=%u, prot_sg_cnt=%u, pages_per_mr=%u\n",
@@ -394,18 +386,14 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		return -EINVAL;
 	}
 
-	ret = rdma_rw_map_sg(dev, sg, sg_cnt, dir);
-	if (!ret)
-		return -ENOMEM;
-	sg_cnt = ret;
+	ret = ib_dma_map_sgtable(dev, &sgt, dir, 0);
+	if (ret)
+		return ret;
 
 	if (prot_sg_cnt) {
-		ret = rdma_rw_map_sg(dev, prot_sg, prot_sg_cnt, dir);
-		if (!ret) {
-			ret = -ENOMEM;
+		ret = ib_dma_map_sgtable(dev, &prot_sgt, dir, 0);
+		if (ret)
 			goto out_unmap_sg;
-		}
-		prot_sg_cnt = ret;
 	}
 
 	ctx->type = RDMA_RW_SIG_MR;
@@ -426,10 +414,11 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 
 	memcpy(ctx->reg->mr->sig_attrs, sig_attrs, sizeof(struct ib_sig_attrs));
 
-	ret = ib_map_mr_sg_pi(ctx->reg->mr, sg, sg_cnt, NULL, prot_sg,
-			      prot_sg_cnt, NULL, SZ_4K);
+	ret = ib_map_mr_sg_pi(ctx->reg->mr, sg, sgt.nents, NULL, prot_sg,
+			      prot_sgt.nents, NULL, SZ_4K);
 	if (unlikely(ret)) {
-		pr_err("failed to map PI sg (%u)\n", sg_cnt + prot_sg_cnt);
+		pr_err("failed to map PI sg (%u)\n",
+		       sgt.nents + prot_sgt.nents);
 		goto out_destroy_sig_mr;
 	}
 
@@ -468,10 +457,10 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 out_free_ctx:
 	kfree(ctx->reg);
 out_unmap_prot_sg:
-	if (prot_sg_cnt)
-		rdma_rw_unmap_sg(dev, prot_sg, prot_sg_cnt, dir);
+	if (prot_sgt.nents)
+		ib_dma_unmap_sgtable(dev, &prot_sgt, dir, 0);
 out_unmap_sg:
-	rdma_rw_unmap_sg(dev, sg, sg_cnt, dir);
+	ib_dma_unmap_sgtable(dev, &sgt, dir, 0);
 	return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_signature_init);
@@ -604,7 +593,7 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 		break;
 	}
 
-	rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
@@ -632,8 +621,8 @@ void rdma_rw_ctx_destroy_signature(struct rdma_rw_ctx *ctx, struct ib_qp *qp,
 	kfree(ctx->reg);
 
 	if (prot_sg_cnt)
-		rdma_rw_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
-	rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+		ib_dma_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
+	ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy_signature);
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 2b71c9ca2186..d04f07ab4d1a 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -4096,6 +4096,25 @@ static inline void ib_dma_unmap_sg_attrs(struct ib_device *dev,
 				   dma_attrs);
 }
 
+static inline int ib_dma_map_sgtable(struct ib_device *dev,
+				     struct sg_table *sgt,
+				     enum dma_data_direction direction,
+				     unsigned long dma_attrs)
+{
+	if (ib_uses_virt_dma(dev))
+		return ib_dma_virt_map_sg(dev, sgt->sgl, sgt->orig_nents);
+	return dma_map_sgtable(dev->dma_device, sgt, direction, dma_attrs);
+}
+
+static inline void ib_dma_unmap_sgtable(struct ib_device *dev,
+					struct sg_table *sgt,
+					enum dma_data_direction direction,
+					unsigned long dma_attrs)
+{
+	if (!ib_uses_virt_dma(dev))
+		dma_unmap_sgtable(dev->dma_device, sgt, direction, dma_attrs);
+}
+
 /**
  * ib_dma_map_sgtable_attrs - Map a scatter/gather table to DMA addresses
  * @dev: The device for which the DMA addresses are to be created
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 13/20] PCI/P2PDMA: remove pci_p2pdma_[un]map_sg()
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (11 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 12/20] RDMA/rw: use dma_map_sgtable() Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-27 18:50   ` Bjorn Helgaas
                     ` (2 more replies)
  2021-09-16 23:40 ` [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
                   ` (7 subsequent siblings)
  20 siblings, 3 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

This interface is superseded by support in dma_map_sg() which now supports
heterogeneous scatterlists. There are no longer any users, so remove it.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c       | 65 --------------------------------------
 include/linux/pci-p2pdma.h | 27 ----------------
 2 files changed, 92 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 58c34f1f1473..4478633346bd 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -878,71 +878,6 @@ enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 	return type;
 }
 
-static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
-		struct device *dev, struct scatterlist *sg, int nents)
-{
-	struct scatterlist *s;
-	int i;
-
-	for_each_sg(sg, s, nents, i) {
-		s->dma_address = sg_phys(s) - p2p_pgmap->bus_offset;
-		sg_dma_len(s) = s->length;
-	}
-
-	return nents;
-}
-
-/**
- * pci_p2pdma_map_sg_attrs - map a PCI peer-to-peer scatterlist for DMA
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: elements in the scatterlist
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_map_sg() (if called)
- *
- * Scatterlists mapped with this function should be unmapped using
- * pci_p2pdma_unmap_sg_attrs().
- *
- * Returns the number of SG entries mapped or 0 on error.
- */
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-		int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-	struct pci_p2pdma_pagemap *p2p_pgmap =
-		to_p2p_pgmap(sg_page(sg)->pgmap);
-
-	switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
-	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-		return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
-	case PCI_P2PDMA_MAP_BUS_ADDR:
-		return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
-	default:
-		return 0;
-	}
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs);
-
-/**
- * pci_p2pdma_unmap_sg_attrs - unmap a PCI peer-to-peer scatterlist that was
- *	mapped with pci_p2pdma_map_sg()
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: number of elements returned by pci_p2pdma_map_sg()
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_unmap_sg() (if called)
- */
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-		int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-	enum pci_p2pdma_map_type map_type;
-
-	map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
-
-	if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
-		dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
-
 /**
  * pci_p2pdma_map_segment - map an sg segment determining the mapping type
  * @state: State structure that should be declared outside of the for_each_sg()
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index e5a8d5bc0f51..0c33a40a86e7 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -72,10 +72,6 @@ void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
 enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 					     struct device *dev);
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-		int nents, enum dma_data_direction dir, unsigned long attrs);
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-		int nents, enum dma_data_direction dir, unsigned long attrs);
 enum pci_p2pdma_map_type
 pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
 		       struct scatterlist *sg);
@@ -135,17 +131,6 @@ pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
 {
 	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 }
-static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
-		struct scatterlist *sg, int nents, enum dma_data_direction dir,
-		unsigned long attrs)
-{
-	return 0;
-}
-static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
-		struct scatterlist *sg, int nents, enum dma_data_direction dir,
-		unsigned long attrs)
-{
-}
 static inline enum pci_p2pdma_map_type
 pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
 		       struct scatterlist *sg)
@@ -181,16 +166,4 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client)
 	return pci_p2pmem_find_many(&client, 1);
 }
 
-static inline int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg,
-				    int nents, enum dma_data_direction dir)
-{
-	return pci_p2pdma_map_sg_attrs(dev, sg, nents, dir, 0);
-}
-
-static inline void pci_p2pdma_unmap_sg(struct device *dev,
-		struct scatterlist *sg, int nents, enum dma_data_direction dir)
-{
-	pci_p2pdma_unmap_sg_attrs(dev, sg, nents, dir, 0);
-}
-
 #endif /* _LINUX_PCI_P2P_H */
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (12 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 13/20] PCI/P2PDMA: remove pci_p2pdma_[un]map_sg() Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-28 19:47   ` Jason Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 15/20] iov_iter: introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If a caller does not set this flag
and tries to map P2PDMA pages it will fail.

This is implemented by adding a flag and a check to get_dev_pagemap().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/dax/super.c      |  7 ++++---
 include/linux/memremap.h |  4 ++--
 include/linux/mm.h       |  1 +
 mm/gup.c                 | 28 +++++++++++++++++-----------
 mm/huge_memory.c         |  8 ++++----
 mm/memory-failure.c      |  4 ++--
 mm/memory_hotplug.c      |  2 +-
 mm/memremap.c            | 14 ++++++++++----
 8 files changed, 41 insertions(+), 27 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index fc89e91beea7..ffb6e57e65bb 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -180,9 +180,10 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
 	} else if (pfn_t_devmap(pfn) && pfn_t_devmap(end_pfn)) {
 		struct dev_pagemap *pgmap, *end_pgmap;
 
-		pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
-		end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL);
-		if (pgmap && pgmap == end_pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX
+		pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL, false);
+		end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL, false);
+		if (!IS_ERR_OR_NULL(pgmap) && pgmap == end_pgmap
+				&& pgmap->type == MEMORY_DEVICE_FS_DAX
 				&& pfn_t_to_page(pfn)->pgmap == pgmap
 				&& pfn_t_to_page(end_pfn)->pgmap == pgmap
 				&& pfn_t_to_pfn(pfn) == PHYS_PFN(__pa(kaddr))
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c0e9d35889e8..f10c332dac8b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -136,7 +136,7 @@ void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-		struct dev_pagemap *pgmap);
+		struct dev_pagemap *pgmap, bool allow_pci_p2pdma);
 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
@@ -161,7 +161,7 @@ static inline void devm_memunmap_pages(struct device *dev,
 }
 
 static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-		struct dev_pagemap *pgmap)
+		struct dev_pagemap *pgmap, bool allow_pci_p2pdma)
 {
 	return NULL;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..6afdc09d0712 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2864,6 +2864,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA	0x100000 /* allow returning PCI P2PDMA pages */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 886d6148d3d0..1a03b9200cd9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -522,11 +522,16 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		 * case since they are only valid while holding the pgmap
 		 * reference.
 		 */
-		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
-		if (*pgmap)
+		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap,
+					 flags & FOLL_PCI_P2PDMA);
+		if (IS_ERR(*pgmap)) {
+			page = ERR_CAST(*pgmap);
+			goto out;
+		} else if (*pgmap) {
 			page = pte_page(pte);
-		else
+		} else {
 			goto no_page;
+		}
 	} else if (unlikely(!page)) {
 		if (flags & FOLL_DUMP) {
 			/* Avoid special (like zero) pages in core dumps */
@@ -846,7 +851,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		return NULL;
 
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
-	if (ctx.pgmap)
+	if (!IS_ERR_OR_NULL(ctx.pgmap))
 		put_dev_pagemap(ctx.pgmap);
 	return page;
 }
@@ -1199,7 +1204,7 @@ static long __get_user_pages(struct mm_struct *mm,
 		nr_pages -= page_increm;
 	} while (nr_pages);
 out:
-	if (ctx.pgmap)
+	if (!IS_ERR_OR_NULL(ctx.pgmap))
 		put_dev_pagemap(ctx.pgmap);
 	return i ? i : ret;
 }
@@ -2149,8 +2154,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			if (unlikely(flags & FOLL_LONGTERM))
 				goto pte_unmap;
 
-			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
-			if (unlikely(!pgmap)) {
+			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap,
+						flags & FOLL_PCI_P2PDMA);
+			if (IS_ERR_OR_NULL(pgmap)) {
 				undo_dev_pagemap(nr, nr_start, flags, pages);
 				goto pte_unmap;
 			}
@@ -2198,7 +2204,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 	ret = 1;
 
 pte_unmap:
-	if (pgmap)
+	if (!IS_ERR_OR_NULL(pgmap))
 		put_dev_pagemap(pgmap);
 	pte_unmap(ptem);
 	return ret;
@@ -2233,8 +2239,8 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 	do {
 		struct page *page = pfn_to_page(pfn);
 
-		pgmap = get_dev_pagemap(pfn, pgmap);
-		if (unlikely(!pgmap)) {
+		pgmap = get_dev_pagemap(pfn, pgmap, flags & FOLL_PCI_P2PDMA);
+		if (IS_ERR_OR_NULL(pgmap)) {
 			undo_dev_pagemap(nr, nr_start, flags, pages);
 			ret = 0;
 			break;
@@ -2708,7 +2714,7 @@ static int internal_get_user_pages_fast(unsigned long start,
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
 				       FOLL_FORCE | FOLL_PIN | FOLL_GET |
-				       FOLL_FAST_ONLY)))
+				       FOLL_FAST_ONLY | FOLL_PCI_P2PDMA)))
 		return -EINVAL;
 
 	if (gup_flags & FOLL_PIN)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5e9ef0fc261e..853157a84b00 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1014,8 +1014,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
-	*pgmap = get_dev_pagemap(pfn, *pgmap);
-	if (!*pgmap)
+	*pgmap = get_dev_pagemap(pfn, *pgmap, flags & FOLL_PCI_P2PDMA);
+	if (IS_ERR_OR_NULL(*pgmap))
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
 	if (!try_grab_page(page, flags))
@@ -1181,8 +1181,8 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
-	*pgmap = get_dev_pagemap(pfn, *pgmap);
-	if (!*pgmap)
+	*pgmap = get_dev_pagemap(pfn, *pgmap, flags & FOLL_PCI_P2PDMA);
+	if (IS_ERR_OR_NULL(*pgmap))
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
 	if (!try_grab_page(page, flags))
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 54879c339024..8f15ccce5aea 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1635,8 +1635,8 @@ int memory_failure(unsigned long pfn, int flags)
 	p = pfn_to_online_page(pfn);
 	if (!p) {
 		if (pfn_valid(pfn)) {
-			pgmap = get_dev_pagemap(pfn, NULL);
-			if (pgmap)
+			pgmap = get_dev_pagemap(pfn, NULL, false);
+			if (!IS_ERR_OR_NULL(pgmap))
 				return memory_failure_dev_pagemap(pfn, flags,
 								  pgmap);
 		}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9fd0be32a281..fa5cf8898b6b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -285,7 +285,7 @@ struct page *pfn_to_online_page(unsigned long pfn)
 	 * the section may be 'offline' but 'valid'. Only
 	 * get_dev_pagemap() can determine sub-section online status.
 	 */
-	pgmap = get_dev_pagemap(pfn, NULL);
+	pgmap = get_dev_pagemap(pfn, NULL, true);
 	put_dev_pagemap(pgmap);
 
 	/* The presence of a pgmap indicates ZONE_DEVICE offline pfn */
diff --git a/mm/memremap.c b/mm/memremap.c
index ed593bf87109..ceebdb8a72bb 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -206,14 +206,14 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 				"altmap not supported for multiple ranges\n"))
 		return -EINVAL;
 
-	conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL);
+	conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL, true);
 	if (conflict_pgmap) {
 		WARN(1, "Conflicting mapping in same section\n");
 		put_dev_pagemap(conflict_pgmap);
 		return -ENOMEM;
 	}
 
-	conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL);
+	conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL, true);
 	if (conflict_pgmap) {
 		WARN(1, "Conflicting mapping in same section\n");
 		put_dev_pagemap(conflict_pgmap);
@@ -465,19 +465,20 @@ void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
  * get_dev_pagemap() - take a new live reference on the dev_pagemap for @pfn
  * @pfn: page frame number to lookup page_map
  * @pgmap: optional known pgmap that already has a reference
+ * @allow_pci_p2pdma: allow getting a pgmap with the PCI P2PDMA type
  *
  * If @pgmap is non-NULL and covers @pfn it will be returned as-is.  If @pgmap
  * is non-NULL but does not cover @pfn the reference to it will be released.
  */
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-		struct dev_pagemap *pgmap)
+		struct dev_pagemap *pgmap, bool allow_pci_p2pdma)
 {
 	resource_size_t phys = PFN_PHYS(pfn);
 
 	/*
 	 * In the cached case we're already holding a live reference.
 	 */
-	if (pgmap) {
+	if (!IS_ERR_OR_NULL(pgmap)) {
 		if (phys >= pgmap->range.start && phys <= pgmap->range.end)
 			return pgmap;
 		put_dev_pagemap(pgmap);
@@ -490,6 +491,11 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 		pgmap = NULL;
 	rcu_read_unlock();
 
+	if (!allow_pci_p2pdma && pgmap->type == MEMORY_DEVICE_PCI_P2PDMA) {
+		put_dev_pagemap(pgmap);
+		return ERR_PTR(-EREMOTEIO);
+	}
+
 	return pgmap;
 }
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 15/20] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (13 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 16/20] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().

This is so that FOLL_PCI_P2PDMA can be passed when appropriate.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/uio.h | 21 +++++++++++++++++----
 lib/iov_iter.c      | 28 ++++++++++++++++------------
 2 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5265024e8b90..d4ce252db728 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -228,10 +228,23 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
 		     loff_t start, size_t count);
-ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
-			size_t maxsize, unsigned maxpages, size_t *start);
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
-			size_t maxsize, size_t *start);
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
+		size_t maxsize, unsigned maxpages, size_t *start,
+		unsigned int gup_flags);
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
+		struct page ***pages, size_t maxsize, size_t *start,
+		unsigned int gup_flags);
+static inline ssize_t iov_iter_get_pages(struct iov_iter *i,
+		struct page **pages, size_t maxsize, unsigned maxpages,
+		size_t *start)
+{
+	return iov_iter_get_pages_flags(i, pages, maxsize, maxpages, start, 0);
+}
+static inline ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+		struct page ***pages, size_t maxsize, size_t *start)
+{
+	return iov_iter_get_pages_alloc_flags(i, pages, maxsize, start, 0);
+}
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
 
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f2d50d69a6c3..bbf0fb6736a9 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1468,9 +1468,9 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 	return page;
 }
 
-ssize_t iov_iter_get_pages(struct iov_iter *i,
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
-		   size_t *start)
+		   size_t *start, unsigned int gup_flags)
 {
 	size_t len;
 	int n, res;
@@ -1485,9 +1485,11 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 
 		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0,
-				pages);
+
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+
+		res = get_user_pages_fast(addr, n, gup_flags, pages);
 		if (unlikely(res < 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1507,7 +1509,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
 	return -EFAULT;
 }
-EXPORT_SYMBOL(iov_iter_get_pages);
+EXPORT_SYMBOL(iov_iter_get_pages_flags);
 
 static struct page **get_pages_array(size_t n)
 {
@@ -1589,9 +1591,9 @@ static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
 	return actual;
 }
 
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
-		   size_t *start)
+		   size_t *start, unsigned int gup_flags)
 {
 	struct page **p;
 	size_t len;
@@ -1604,14 +1606,16 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 
 	if (likely(iter_is_iovec(i))) {
 		unsigned long addr;
-
 		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
 		p = get_pages_array(n);
 		if (!p)
 			return -ENOMEM;
-		res = get_user_pages_fast(addr, n,
-				iov_iter_rw(i) != WRITE ?  FOLL_WRITE : 0, p);
+
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+
+		res = get_user_pages_fast(addr, n, gup_flags, p);
 		if (unlikely(res < 0)) {
 			kvfree(p);
 			return res;
@@ -1637,7 +1641,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
 	return -EFAULT;
 }
-EXPORT_SYMBOL(iov_iter_get_pages_alloc);
+EXPORT_SYMBOL(iov_iter_get_pages_alloc_flags);
 
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 			       struct iov_iter *i)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 16/20] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (14 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 15/20] iov_iter: introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 17/20] block: set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/bio.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 5df3dd282e40..2436e83fe3b4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1088,6 +1088,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
 	bool same_page = false;
+	unsigned int flags = 0;
 	ssize_t size, left;
 	unsigned len, i;
 	size_t offset;
@@ -1100,7 +1101,12 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	if (bio->bi_bdev && bio->bi_bdev->bd_disk &&
+	    blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
+		flags |= FOLL_PCI_P2PDMA;
+
+	size = iov_iter_get_pages_flags(iter, pages, LONG_MAX, nr_pages,
+					&offset, flags);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 17/20] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (15 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 16/20] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 18/20] mm: use custom page_free for P2PDMA pages Logan Gunthorpe
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 block/blk-map.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 4526adde0156..7508448e290c 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -234,6 +234,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		gfp_t gfp_mask)
 {
 	unsigned int max_sectors = queue_max_hw_sectors(rq->q);
+	unsigned int flags = 0;
 	struct bio *bio;
 	int ret;
 	int j;
@@ -246,13 +247,17 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		return -ENOMEM;
 	bio->bi_opf |= req_op(rq);
 
+	if (blk_queue_pci_p2pdma(rq->q))
+		flags |= FOLL_PCI_P2PDMA;
+
 	while (iov_iter_count(iter)) {
 		struct page **pages;
 		ssize_t bytes;
 		size_t offs, added = 0;
 		int npages;
 
-		bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+		bytes = iov_iter_get_pages_alloc_flags(iter, &pages, LONG_MAX,
+						       &offs, flags);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 18/20] mm: use custom page_free for P2PDMA pages
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (16 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 17/20] block: set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-16 23:40 ` [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem() Logan Gunthorpe
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

When P2PDMA pages are passed to userspace, they will need to be
reference counted properly and returned to their genalloc after their
reference count returns to 1. This is accomplished with the existing
DEV_PAGEMAP_OPS and the .page_free() operation.

Change CONFIG_P2PDMA to select CONFIG_DEV_PAGEMAP_OPS and add
MEMORY_DEVICE_PCI_P2PDMA to page_is_devmap_managed(),
devmap_managed_enable_[put|get]() and free_devmap_managed_page().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/Kconfig  |  1 +
 drivers/pci/p2pdma.c | 13 +++++++++++++
 include/linux/mm.h   |  1 +
 mm/memremap.c        | 12 +++++++++---
 4 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 90b4bddb3300..b31d35259d3a 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -165,6 +165,7 @@ config PCI_P2PDMA
 	bool "PCI peer-to-peer transfer support"
 	depends on ZONE_DEVICE && 64BIT
 	select GENERIC_ALLOCATOR
+	select DEV_PAGEMAP_OPS
 	help
 	  Enableѕ drivers to do PCI peer-to-peer transactions to and from
 	  BARs that are exposed in other devices that are the part of
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4478633346bd..2422af5a529c 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -100,6 +100,18 @@ static const struct attribute_group p2pmem_group = {
 	.name = "p2pmem",
 };
 
+static void p2pdma_page_free(struct page *page)
+{
+	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+
+	gen_pool_free(pgmap->provider->p2pdma->pool,
+		      (uintptr_t)page_to_virt(page), PAGE_SIZE);
+}
+
+static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
+	.page_free = p2pdma_page_free,
+};
+
 static void pci_p2pdma_release(void *data)
 {
 	struct pci_dev *pdev = data;
@@ -197,6 +209,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->range.end = pgmap->range.start + size - 1;
 	pgmap->nr_range = 1;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+	pgmap->ops = &p2pdma_pgmap_ops;
 
 	p2p_pgmap->provider = pdev;
 	p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6afdc09d0712..9a6ea00e5292 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1163,6 +1163,7 @@ static inline bool page_is_devmap_managed(struct page *page)
 	switch (page->pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 	case MEMORY_DEVICE_FS_DAX:
+	case MEMORY_DEVICE_PCI_P2PDMA:
 		return true;
 	default:
 		break;
diff --git a/mm/memremap.c b/mm/memremap.c
index ceebdb8a72bb..fbdc9991af0e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -44,14 +44,16 @@ EXPORT_SYMBOL(devmap_managed_key);
 static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 {
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
-	    pgmap->type == MEMORY_DEVICE_FS_DAX)
+	    pgmap->type == MEMORY_DEVICE_FS_DAX ||
+	    pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
 		static_branch_dec(&devmap_managed_key);
 }
 
 static void devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
-	    pgmap->type == MEMORY_DEVICE_FS_DAX)
+	    pgmap->type == MEMORY_DEVICE_FS_DAX ||
+	    pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
 		static_branch_inc(&devmap_managed_key);
 }
 #else
@@ -355,6 +357,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 	case MEMORY_DEVICE_GENERIC:
 		break;
 	case MEMORY_DEVICE_PCI_P2PDMA:
+		if (!pgmap->ops->page_free) {
+			WARN(1, "Missing page_free method\n");
+			return ERR_PTR(-EINVAL);
+		}
 		params.pgprot = pgprot_noncached(params.pgprot);
 		break;
 	default:
@@ -504,7 +510,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 void free_devmap_managed_page(struct page *page)
 {
 	/* notify page idle for dax */
-	if (!is_device_private_page(page)) {
+	if (!is_device_private_page(page) && !is_pci_p2pdma_page(page)) {
 		wake_up_var(&page->_refcount);
 		return;
 	}
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (17 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 18/20] mm: use custom page_free for P2PDMA pages Logan Gunthorpe
@ 2021-09-16 23:40 ` Logan Gunthorpe
  2021-09-27 18:49   ` Bjorn Helgaas
                     ` (2 more replies)
  2021-09-16 23:41 ` [PATCH v3 20/20] nvme-pci: allow mmaping the CMB in userspace Logan Gunthorpe
  2021-09-28 20:02 ` [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Jason Gunthorpe
  20 siblings, 3 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:40 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
a hunk of p2pmem into userspace.

Pages are allocated from the genalloc in bulk and their reference count
incremented. They are returned to the genalloc when the page is put.

The VMA does not take a reference to the pages when they are inserted
with vmf_insert_mixed() (which is necessary for zone device pages) so
the backing P2P memory is stored in a structures in vm_private_data.

A pseudo mount is used to allocate an inode for each PCI device. The
inode's address_space is used in the file doing the mmap so that all
VMAs are collected and can be unmapped if the PCI device is unbound.
After unmapping, the VMAs are iterated through and their pages are
put so the device can continue to be unbound. An active flag is used
to signal to VMAs not to allocate any further P2P memory once the
removal process starts. The flag is synchronized with concurrent
access with an RCU lock.

The VMAs and inode will survive after the unbind of the device, but no
pages will be present in the VMA and a subsequent access will result
in a SIGBUS error.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c       | 263 ++++++++++++++++++++++++++++++++++++-
 include/linux/pci-p2pdma.h |  11 ++
 include/uapi/linux/magic.h |   1 +
 3 files changed, 273 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 2422af5a529c..a5adf57af53a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -16,14 +16,19 @@
 #include <linux/genalloc.h>
 #include <linux/memremap.h>
 #include <linux/percpu-refcount.h>
+#include <linux/pfn_t.h>
+#include <linux/pseudo_fs.h>
 #include <linux/random.h>
 #include <linux/seq_buf.h>
 #include <linux/xarray.h>
+#include <uapi/linux/magic.h>
 
 struct pci_p2pdma {
 	struct gen_pool *pool;
 	bool p2pmem_published;
 	struct xarray map_types;
+	struct inode *inode;
+	bool active;
 };
 
 struct pci_p2pdma_pagemap {
@@ -32,6 +37,14 @@ struct pci_p2pdma_pagemap {
 	u64 bus_offset;
 };
 
+struct pci_p2pdma_map {
+	struct kref ref;
+	struct pci_dev *pdev;
+	struct inode *inode;
+	void *kaddr;
+	size_t len;
+};
+
 static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
 {
 	return container_of(pgmap, struct pci_p2pdma_pagemap, pgmap);
@@ -100,6 +113,26 @@ static const struct attribute_group p2pmem_group = {
 	.name = "p2pmem",
 };
 
+/*
+ * P2PDMA internal mount
+ * Fake an internal VFS mount-point in order to allocate struct address_space
+ * mappings to remove VMAs on unbind events.
+ */
+static int pci_p2pdma_fs_cnt;
+static struct vfsmount *pci_p2pdma_fs_mnt;
+
+static int pci_p2pdma_fs_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, P2PDMA_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type pci_p2pdma_fs_type = {
+	.name = "p2dma",
+	.owner = THIS_MODULE,
+	.init_fs_context = pci_p2pdma_fs_init_fs_context,
+	.kill_sb = kill_anon_super,
+};
+
 static void p2pdma_page_free(struct page *page)
 {
 	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
@@ -128,6 +161,9 @@ static void pci_p2pdma_release(void *data)
 	gen_pool_destroy(p2pdma->pool);
 	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
 	xa_destroy(&p2pdma->map_types);
+
+	iput(p2pdma->inode);
+	simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
 }
 
 static int pci_p2pdma_setup(struct pci_dev *pdev)
@@ -145,17 +181,32 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 	if (!p2p->pool)
 		goto out;
 
-	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	error = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt,
+			      &pci_p2pdma_fs_cnt);
 	if (error)
 		goto out_pool_destroy;
 
+	p2p->inode = alloc_anon_inode(pci_p2pdma_fs_mnt->mnt_sb);
+	if (IS_ERR(p2p->inode)) {
+		error = -ENOMEM;
+		goto out_unpin_fs;
+	}
+
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (error)
+		goto out_put_inode;
+
 	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
 	if (error)
-		goto out_pool_destroy;
+		goto out_put_inode;
 
 	rcu_assign_pointer(pdev->p2pdma, p2p);
 	return 0;
 
+out_put_inode:
+	iput(p2p->inode);
+out_unpin_fs:
+	simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
 out_pool_destroy:
 	gen_pool_destroy(p2p->pool);
 out:
@@ -163,6 +214,45 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 	return error;
 }
 
+static void pci_p2pdma_map_free_pages(struct pci_p2pdma_map *pmap)
+{
+	int i;
+
+	if (!pmap->kaddr)
+		return;
+
+	for (i = 0; i < pmap->len; i += PAGE_SIZE)
+		put_page(virt_to_page(pmap->kaddr + i));
+
+	pmap->kaddr = NULL;
+}
+
+static void pci_p2pdma_free_mappings(struct address_space *mapping)
+{
+	struct vm_area_struct *vma;
+
+	i_mmap_lock_write(mapping);
+	if (RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
+		goto out;
+
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, 0, -1)
+		pci_p2pdma_map_free_pages(vma->vm_private_data);
+
+out:
+	i_mmap_unlock_write(mapping);
+}
+
+static void pci_p2pdma_unmap_mappings(void *data)
+{
+	struct pci_dev *pdev = data;
+	struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
+
+	p2pdma->active = false;
+	synchronize_rcu();
+	unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
+	pci_p2pdma_free_mappings(p2pdma->inode->i_mapping);
+}
+
 /**
  * pci_p2pdma_add_resource - add memory for use as p2p memory
  * @pdev: the device to add the memory to
@@ -221,6 +311,11 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		goto pgmap_free;
 	}
 
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
+					 pdev);
+	if (error)
+		goto pages_free;
+
 	p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
 	error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr,
 			pci_bus_address(pdev, bar) + offset,
@@ -229,6 +324,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	if (error)
 		goto pages_free;
 
+	p2pdma->active = true;
 	pci_info(pdev, "added peer-to-peer DMA memory %#llx-%#llx\n",
 		 pgmap->range.start, pgmap->range.end);
 
@@ -1029,3 +1125,166 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
 	return sprintf(page, "%s\n", pci_name(p2p_dev));
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
+
+static struct pci_p2pdma_map *pci_p2pdma_map_alloc(struct pci_dev *pdev,
+						   size_t len)
+{
+	struct pci_p2pdma_map *pmap;
+
+	pmap = kzalloc(sizeof(*pmap), GFP_KERNEL);
+	if (!pmap)
+		return NULL;
+
+	kref_init(&pmap->ref);
+	pmap->pdev = pci_dev_get(pdev);
+	pmap->len = len;
+
+	return pmap;
+}
+
+static void pci_p2pdma_map_free(struct kref *ref)
+{
+	struct pci_p2pdma_map *pmap =
+		container_of(ref, struct pci_p2pdma_map, ref);
+
+	pci_p2pdma_map_free_pages(pmap);
+	pci_dev_put(pmap->pdev);
+	iput(pmap->inode);
+	simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
+	kfree(pmap);
+}
+
+static void pci_p2pdma_vma_open(struct vm_area_struct *vma)
+{
+	struct pci_p2pdma_map *pmap = vma->vm_private_data;
+
+	kref_get(&pmap->ref);
+}
+
+static void pci_p2pdma_vma_close(struct vm_area_struct *vma)
+{
+	struct pci_p2pdma_map *pmap = vma->vm_private_data;
+
+	kref_put(&pmap->ref, pci_p2pdma_map_free);
+}
+
+static vm_fault_t pci_p2pdma_vma_fault(struct vm_fault *vmf)
+{
+	struct pci_p2pdma_map *pmap = vmf->vma->vm_private_data;
+	struct pci_p2pdma *p2pdma;
+	void *vaddr;
+	pfn_t pfn;
+	int i;
+
+	if (!pmap->kaddr) {
+		rcu_read_lock();
+		p2pdma = rcu_dereference(pmap->pdev->p2pdma);
+		if (!p2pdma)
+			goto err_out;
+
+		if (!p2pdma->active)
+			goto err_out;
+
+		pmap->kaddr = (void *)gen_pool_alloc(p2pdma->pool, pmap->len);
+		if (!pmap->kaddr)
+			goto err_out;
+
+		for (i = 0; i < pmap->len; i += PAGE_SIZE)
+			get_page(virt_to_page(pmap->kaddr + i));
+
+		rcu_read_unlock();
+	}
+
+	vaddr = pmap->kaddr + (vmf->pgoff << PAGE_SHIFT);
+	pfn = phys_to_pfn_t(virt_to_phys(vaddr), PFN_DEV | PFN_MAP);
+
+	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+
+err_out:
+	rcu_read_unlock();
+	return VM_FAULT_SIGBUS;
+}
+static const struct vm_operations_struct pci_p2pdma_vmops = {
+	.open = pci_p2pdma_vma_open,
+	.close = pci_p2pdma_vma_close,
+	.fault = pci_p2pdma_vma_fault,
+};
+
+/**
+ * pci_p2pdma_mmap_file_open - setup file mapping to store P2PMEM VMAs
+ * @pdev: the device to allocate memory from
+ * @vma: the userspace vma to map the memory to
+ *
+ * Set f_mapping of the file to the p2pdma inode so that mappings
+ * are can be torn down on device unbind.
+ *
+ * Returns 0 on success, or a negative error code on failure
+ */
+void pci_p2pdma_mmap_file_open(struct pci_dev *pdev, struct file *file)
+{
+	struct pci_p2pdma *p2pdma;
+
+	rcu_read_lock();
+	p2pdma = rcu_dereference(pdev->p2pdma);
+	if (p2pdma)
+		file->f_mapping = p2pdma->inode->i_mapping;
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_mmap_file_open);
+
+/**
+ * pci_mmap_p2pmem - setup an mmap region to be backed with P2PDMA memory
+ *	that was registered with pci_p2pdma_add_resource()
+ * @pdev: the device to allocate memory from
+ * @vma: the userspace vma to map the memory to
+ *
+ * The file must call pci_p2pdma_mmap_file_open() in its open() operation.
+ *
+ * Returns 0 on success, or a negative error code on failure
+ */
+int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma)
+{
+	struct pci_p2pdma_map *pmap;
+	struct pci_p2pdma *p2pdma;
+	int ret;
+
+	/* prevent private mappings from being established */
+	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+		pci_info_ratelimited(pdev,
+				     "%s: fail, attempted private mapping\n",
+				     current->comm);
+		return -EINVAL;
+	}
+
+	pmap = pci_p2pdma_map_alloc(pdev, vma->vm_end - vma->vm_start);
+	if (!pmap)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	p2pdma = rcu_dereference(pdev->p2pdma);
+	if (!p2pdma) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	ret = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt,
+			    &pci_p2pdma_fs_cnt);
+	if (ret)
+		goto out;
+
+	ihold(p2pdma->inode);
+	pmap->inode = p2pdma->inode;
+	rcu_read_unlock();
+
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_private_data = pmap;
+	vma->vm_ops = &pci_p2pdma_vmops;
+
+	return 0;
+
+out:
+	rcu_read_unlock();
+	kfree(pmap);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pci_mmap_p2pmem);
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 0c33a40a86e7..f9f19f3db676 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -81,6 +81,8 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 			    bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
 			       bool use_p2pdma);
+void pci_p2pdma_mmap_file_open(struct pci_dev *pdev, struct file *file);
+int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
@@ -152,6 +154,15 @@ static inline ssize_t pci_p2pdma_enable_show(char *page,
 {
 	return sprintf(page, "none\n");
 }
+static inline void pci_p2pdma_mmap_file_open(struct pci_dev *pdev,
+					     struct file *file)
+{
+}
+static inline int pci_mmap_p2pmem(struct pci_dev *pdev,
+				  struct vm_area_struct *vma)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_PCI_P2PDMA */
 
 
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 35687dcb1a42..af737842c56f 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -88,6 +88,7 @@
 #define BPF_FS_MAGIC		0xcafe4a11
 #define AAFS_MAGIC		0x5a3c69f0
 #define ZONEFS_MAGIC		0x5a4f4653
+#define P2PDMA_MAGIC		0x70327064
 
 /* Since UDF 2.01 is ISO 13346 based... */
 #define UDF_SUPER_MAGIC		0x15013346
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 20/20] nvme-pci: allow mmaping the CMB in userspace
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (18 preceding siblings ...)
  2021-09-16 23:40 ` [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem() Logan Gunthorpe
@ 2021-09-16 23:41 ` Logan Gunthorpe
  2021-09-28 20:02 ` [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Jason Gunthorpe
  20 siblings, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-16 23:41 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Logan Gunthorpe

Allow userspace to obtain CMB memory by mmaping the controller's
char device. The mmap call allocates and returns a hunk of CMB memory,
(the offset is ignored) so userspace does not have control over the
address within the CMB.

A VMA allocated in this way will only be usable by drivers that set
FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be
checked the first time the pages are mapped for DMA.

Currently this is only supported by O_DIRECT to an PCI NVMe device
or through the NVMe passthrough IOCTL.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/core.c | 15 +++++++++++++++
 drivers/nvme/host/nvme.h |  2 ++
 drivers/nvme/host/pci.c  | 18 ++++++++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 916750a54f60..dfc18f0bdeee 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3075,6 +3075,10 @@ static int nvme_dev_open(struct inode *inode, struct file *file)
 	}
 
 	file->private_data = ctrl;
+
+	if (ctrl->ops->mmap_file_open)
+		ctrl->ops->mmap_file_open(ctrl, file);
+
 	return 0;
 }
 
@@ -3088,12 +3092,23 @@ static int nvme_dev_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+static int nvme_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct nvme_ctrl *ctrl = file->private_data;
+
+	if (!ctrl->ops->mmap_cmb)
+		return -ENODEV;
+
+	return ctrl->ops->mmap_cmb(ctrl, vma);
+}
+
 static const struct file_operations nvme_dev_fops = {
 	.owner		= THIS_MODULE,
 	.open		= nvme_dev_open,
 	.release	= nvme_dev_release,
 	.unlocked_ioctl	= nvme_dev_ioctl,
 	.compat_ioctl	= compat_ptr_ioctl,
+	.mmap		= nvme_dev_mmap,
 };
 
 static ssize_t nvme_sysfs_reset(struct device *dev,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index fb9bfc52a6d7..1cc721290d4c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -485,6 +485,8 @@ struct nvme_ctrl_ops {
 	void (*delete_ctrl)(struct nvme_ctrl *ctrl);
 	int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
 	bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
+	void (*mmap_file_open)(struct nvme_ctrl *ctrl, struct file *file);
+	int (*mmap_cmb)(struct nvme_ctrl *ctrl, struct vm_area_struct *vma);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e2cd73129a88..9d69e4a3d62e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2870,6 +2870,22 @@ static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
 	return dma_pci_p2pdma_supported(dev->dev);
 }
 
+static void nvme_pci_mmap_file_open(struct nvme_ctrl *ctrl,
+				    struct file *file)
+{
+	struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+	pci_p2pdma_mmap_file_open(pdev, file);
+}
+
+static int nvme_pci_mmap_cmb(struct nvme_ctrl *ctrl,
+			     struct vm_area_struct *vma)
+{
+	struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+	return pci_mmap_p2pmem(pdev, vma);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
@@ -2881,6 +2897,8 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.submit_async_event	= nvme_pci_submit_async_event,
 	.get_address		= nvme_pci_get_address,
 	.supports_pci_p2pdma	= nvme_pci_supports_pci_p2pdma,
+	.mmap_file_open		= nvme_pci_mmap_file_open,
+	.mmap_cmb		= nvme_pci_mmap_cmb,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/20] PCI/P2PDMA: make pci_p2pdma_map_type() non-static
  2021-09-16 23:40 ` [PATCH v3 03/20] PCI/P2PDMA: make pci_p2pdma_map_type() non-static Logan Gunthorpe
@ 2021-09-27 18:46   ` Bjorn Helgaas
  2021-09-28 18:48   ` Jason Gunthorpe
  1 sibling, 0 replies; 87+ messages in thread
From: Bjorn Helgaas @ 2021-09-27 18:46 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:43PM -0600, Logan Gunthorpe wrote:
> pci_p2pdma_map_type() will be needed by the dma-iommu map_sg
> implementation because it will need to determine the mapping type
> ahead of actually doing the mapping to create the actual iommu mapping.

I don't expect this to go via the PCI tree, but if it did I would
silently:

  s/PCI/P2PDMA: make pci_p2pdma_map_type() non-static/
    PCI/P2PDMA: Expose pci_p2pdma_map_type()/
  s/iommu/IOMMU/

and mention what this patch does in the commit log (in addition to the
subject) and fix a couple minor typos below.

> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/p2pdma.c       | 24 +++++++++++++---------
>  include/linux/pci-p2pdma.h | 41 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 1192c465ba6d..b656d8c801a7 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -20,13 +20,6 @@
>  #include <linux/seq_buf.h>
>  #include <linux/xarray.h>
>  
> -enum pci_p2pdma_map_type {
> -	PCI_P2PDMA_MAP_UNKNOWN = 0,
> -	PCI_P2PDMA_MAP_NOT_SUPPORTED,
> -	PCI_P2PDMA_MAP_BUS_ADDR,
> -	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
> -};
> -
>  struct pci_p2pdma {
>  	struct gen_pool *pool;
>  	bool p2pmem_published;
> @@ -841,8 +834,21 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
>  
> -static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
> -						    struct device *dev)
> +/**
> + * pci_p2pdma_map_type - return the type of mapping that should be used for
> + *	a given device and pgmap
> + * @pgmap: the pagemap of a page to determine the mapping type for
> + * @dev: device that is mapping the page
> + *
> + * Returns one of:
> + *	PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done
> + *	PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address
> + *	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done normally
> + *		using the CPU physical address (in dma-direct) or an IOVA
> + *		mapping for the IOMMU.
> + */
> +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
> +					     struct device *dev)
>  {
>  	enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
>  	struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 8318a97c9c61..caac2d023f8f 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -16,6 +16,40 @@
>  struct block_device;
>  struct scatterlist;
>  
> +enum pci_p2pdma_map_type {
> +	/*
> +	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
> +	 * type hasn't been calculated yet. Functions that return this enum
> +	 * never return this value.
> +	 */
> +	PCI_P2PDMA_MAP_UNKNOWN = 0,
> +
> +	/*
> +	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
> +	 * traverse the host bridge and the host bridge is not in the
> +	 * whitelist. DMA Mapping routines should return an error when
> +	 * this is returned.
> +	 */
> +	PCI_P2PDMA_MAP_NOT_SUPPORTED,
> +
> +	/*
> +	 * PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
> +	 * eachother directly through a PCI switch and the transaction will
> +	 * not traverse the host bridge. Such a mapping should program
> +	 * the DMA engine with PCI bus addresses.

s/eachother/each other/

> +	 */
> +	PCI_P2PDMA_MAP_BUS_ADDR,
> +
> +	/*
> +	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
> +	 * to eachother, but the transaction traverses a host bridge on the
> +	 * whitelist. In this case, a normal mapping either with CPU physical
> +	 * addresses (in the case of dma-direct) or IOVA addresses (in the
> +	 * case of IOMMUs) should be used to program the DMA engine.

s/eachother/each other/

> +	 */
> +	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
> +};
> +
>  #ifdef CONFIG_PCI_P2PDMA
>  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		u64 offset);
> @@ -30,6 +64,8 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
>  					 unsigned int *nents, u32 length);
>  void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
>  void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
> +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
> +					     struct device *dev);
>  int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
>  		int nents, enum dma_data_direction dir, unsigned long attrs);
>  void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> @@ -83,6 +119,11 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
>  static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
>  {
>  }
> +static inline enum pci_p2pdma_map_type
> +pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
> +{
> +	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
> +}
>  static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
>  		struct scatterlist *sg, int nents, enum dma_data_direction dir,
>  		unsigned long attrs)
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-09-16 23:40 ` [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem() Logan Gunthorpe
@ 2021-09-27 18:49   ` Bjorn Helgaas
  2021-09-28 19:55   ` Jason Gunthorpe
  2021-09-28 20:05   ` Jason Gunthorpe
  2 siblings, 0 replies; 87+ messages in thread
From: Bjorn Helgaas @ 2021-09-27 18:49 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:59PM -0600, Logan Gunthorpe wrote:
> Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
> a hunk of p2pmem into userspace.
> 
> Pages are allocated from the genalloc in bulk and their reference count
> incremented. They are returned to the genalloc when the page is put.
> 
> The VMA does not take a reference to the pages when they are inserted
> with vmf_insert_mixed() (which is necessary for zone device pages) so
> the backing P2P memory is stored in a structures in vm_private_data.
> 
> A pseudo mount is used to allocate an inode for each PCI device. The
> inode's address_space is used in the file doing the mmap so that all
> VMAs are collected and can be unmapped if the PCI device is unbound.
> After unmapping, the VMAs are iterated through and their pages are
> put so the device can continue to be unbound. An active flag is used
> to signal to VMAs not to allocate any further P2P memory once the
> removal process starts. The flag is synchronized with concurrent
> access with an RCU lock.
> 
> The VMAs and inode will survive after the unbind of the device, but no
> pages will be present in the VMA and a subsequent access will result
> in a SIGBUS error.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

I would capitalize "Introduce" in the subject line.

> ---
>  drivers/pci/p2pdma.c       | 263 ++++++++++++++++++++++++++++++++++++-
>  include/linux/pci-p2pdma.h |  11 ++
>  include/uapi/linux/magic.h |   1 +
>  3 files changed, 273 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 2422af5a529c..a5adf57af53a 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -16,14 +16,19 @@
>  #include <linux/genalloc.h>
>  #include <linux/memremap.h>
>  #include <linux/percpu-refcount.h>
> +#include <linux/pfn_t.h>
> +#include <linux/pseudo_fs.h>
>  #include <linux/random.h>
>  #include <linux/seq_buf.h>
>  #include <linux/xarray.h>
> +#include <uapi/linux/magic.h>
>  
>  struct pci_p2pdma {
>  	struct gen_pool *pool;
>  	bool p2pmem_published;
>  	struct xarray map_types;
> +	struct inode *inode;
> +	bool active;
>  };
>  
>  struct pci_p2pdma_pagemap {
> @@ -32,6 +37,14 @@ struct pci_p2pdma_pagemap {
>  	u64 bus_offset;
>  };
>  
> +struct pci_p2pdma_map {
> +	struct kref ref;
> +	struct pci_dev *pdev;
> +	struct inode *inode;
> +	void *kaddr;
> +	size_t len;
> +};
> +
>  static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
>  {
>  	return container_of(pgmap, struct pci_p2pdma_pagemap, pgmap);
> @@ -100,6 +113,26 @@ static const struct attribute_group p2pmem_group = {
>  	.name = "p2pmem",
>  };
>  
> +/*
> + * P2PDMA internal mount
> + * Fake an internal VFS mount-point in order to allocate struct address_space
> + * mappings to remove VMAs on unbind events.
> + */
> +static int pci_p2pdma_fs_cnt;
> +static struct vfsmount *pci_p2pdma_fs_mnt;
> +
> +static int pci_p2pdma_fs_init_fs_context(struct fs_context *fc)
> +{
> +	return init_pseudo(fc, P2PDMA_MAGIC) ? 0 : -ENOMEM;
> +}
> +
> +static struct file_system_type pci_p2pdma_fs_type = {
> +	.name = "p2dma",
> +	.owner = THIS_MODULE,
> +	.init_fs_context = pci_p2pdma_fs_init_fs_context,
> +	.kill_sb = kill_anon_super,
> +};
> +
>  static void p2pdma_page_free(struct page *page)
>  {
>  	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
> @@ -128,6 +161,9 @@ static void pci_p2pdma_release(void *data)
>  	gen_pool_destroy(p2pdma->pool);
>  	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
>  	xa_destroy(&p2pdma->map_types);
> +
> +	iput(p2pdma->inode);
> +	simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
>  }
>  
>  static int pci_p2pdma_setup(struct pci_dev *pdev)
> @@ -145,17 +181,32 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
>  	if (!p2p->pool)
>  		goto out;
>  
> -	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
> +	error = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt,
> +			      &pci_p2pdma_fs_cnt);
>  	if (error)
>  		goto out_pool_destroy;
>  
> +	p2p->inode = alloc_anon_inode(pci_p2pdma_fs_mnt->mnt_sb);
> +	if (IS_ERR(p2p->inode)) {
> +		error = -ENOMEM;
> +		goto out_unpin_fs;
> +	}
> +
> +	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
> +	if (error)
> +		goto out_put_inode;
> +
>  	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
>  	if (error)
> -		goto out_pool_destroy;
> +		goto out_put_inode;
>  
>  	rcu_assign_pointer(pdev->p2pdma, p2p);
>  	return 0;
>  
> +out_put_inode:
> +	iput(p2p->inode);
> +out_unpin_fs:
> +	simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
>  out_pool_destroy:
>  	gen_pool_destroy(p2p->pool);
>  out:
> @@ -163,6 +214,45 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
>  	return error;
>  }
>  
> +static void pci_p2pdma_map_free_pages(struct pci_p2pdma_map *pmap)
> +{
> +	int i;
> +
> +	if (!pmap->kaddr)
> +		return;
> +
> +	for (i = 0; i < pmap->len; i += PAGE_SIZE)
> +		put_page(virt_to_page(pmap->kaddr + i));
> +
> +	pmap->kaddr = NULL;
> +}
> +
> +static void pci_p2pdma_free_mappings(struct address_space *mapping)
> +{
> +	struct vm_area_struct *vma;
> +
> +	i_mmap_lock_write(mapping);
> +	if (RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
> +		goto out;
> +
> +	vma_interval_tree_foreach(vma, &mapping->i_mmap, 0, -1)
> +		pci_p2pdma_map_free_pages(vma->vm_private_data);
> +
> +out:
> +	i_mmap_unlock_write(mapping);
> +}
> +
> +static void pci_p2pdma_unmap_mappings(void *data)
> +{
> +	struct pci_dev *pdev = data;
> +	struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
> +
> +	p2pdma->active = false;
> +	synchronize_rcu();
> +	unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
> +	pci_p2pdma_free_mappings(p2pdma->inode->i_mapping);
> +}
> +
>  /**
>   * pci_p2pdma_add_resource - add memory for use as p2p memory
>   * @pdev: the device to add the memory to
> @@ -221,6 +311,11 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		goto pgmap_free;
>  	}
>  
> +	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
> +					 pdev);
> +	if (error)
> +		goto pages_free;
> +
>  	p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
>  	error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr,
>  			pci_bus_address(pdev, bar) + offset,
> @@ -229,6 +324,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  	if (error)
>  		goto pages_free;
>  
> +	p2pdma->active = true;
>  	pci_info(pdev, "added peer-to-peer DMA memory %#llx-%#llx\n",
>  		 pgmap->range.start, pgmap->range.end);
>  
> @@ -1029,3 +1125,166 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
>  	return sprintf(page, "%s\n", pci_name(p2p_dev));
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
> +
> +static struct pci_p2pdma_map *pci_p2pdma_map_alloc(struct pci_dev *pdev,
> +						   size_t len)
> +{
> +	struct pci_p2pdma_map *pmap;
> +
> +	pmap = kzalloc(sizeof(*pmap), GFP_KERNEL);
> +	if (!pmap)
> +		return NULL;
> +
> +	kref_init(&pmap->ref);
> +	pmap->pdev = pci_dev_get(pdev);
> +	pmap->len = len;
> +
> +	return pmap;
> +}
> +
> +static void pci_p2pdma_map_free(struct kref *ref)
> +{
> +	struct pci_p2pdma_map *pmap =
> +		container_of(ref, struct pci_p2pdma_map, ref);
> +
> +	pci_p2pdma_map_free_pages(pmap);
> +	pci_dev_put(pmap->pdev);
> +	iput(pmap->inode);
> +	simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
> +	kfree(pmap);
> +}
> +
> +static void pci_p2pdma_vma_open(struct vm_area_struct *vma)
> +{
> +	struct pci_p2pdma_map *pmap = vma->vm_private_data;
> +
> +	kref_get(&pmap->ref);
> +}
> +
> +static void pci_p2pdma_vma_close(struct vm_area_struct *vma)
> +{
> +	struct pci_p2pdma_map *pmap = vma->vm_private_data;
> +
> +	kref_put(&pmap->ref, pci_p2pdma_map_free);
> +}
> +
> +static vm_fault_t pci_p2pdma_vma_fault(struct vm_fault *vmf)
> +{
> +	struct pci_p2pdma_map *pmap = vmf->vma->vm_private_data;
> +	struct pci_p2pdma *p2pdma;
> +	void *vaddr;
> +	pfn_t pfn;
> +	int i;
> +
> +	if (!pmap->kaddr) {
> +		rcu_read_lock();
> +		p2pdma = rcu_dereference(pmap->pdev->p2pdma);
> +		if (!p2pdma)
> +			goto err_out;
> +
> +		if (!p2pdma->active)
> +			goto err_out;
> +
> +		pmap->kaddr = (void *)gen_pool_alloc(p2pdma->pool, pmap->len);
> +		if (!pmap->kaddr)
> +			goto err_out;
> +
> +		for (i = 0; i < pmap->len; i += PAGE_SIZE)
> +			get_page(virt_to_page(pmap->kaddr + i));
> +
> +		rcu_read_unlock();
> +	}
> +
> +	vaddr = pmap->kaddr + (vmf->pgoff << PAGE_SHIFT);
> +	pfn = phys_to_pfn_t(virt_to_phys(vaddr), PFN_DEV | PFN_MAP);
> +
> +	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
> +
> +err_out:
> +	rcu_read_unlock();
> +	return VM_FAULT_SIGBUS;
> +}
> +static const struct vm_operations_struct pci_p2pdma_vmops = {
> +	.open = pci_p2pdma_vma_open,
> +	.close = pci_p2pdma_vma_close,
> +	.fault = pci_p2pdma_vma_fault,
> +};
> +
> +/**
> + * pci_p2pdma_mmap_file_open - setup file mapping to store P2PMEM VMAs
> + * @pdev: the device to allocate memory from
> + * @vma: the userspace vma to map the memory to
> + *
> + * Set f_mapping of the file to the p2pdma inode so that mappings
> + * are can be torn down on device unbind.
> + *
> + * Returns 0 on success, or a negative error code on failure
> + */
> +void pci_p2pdma_mmap_file_open(struct pci_dev *pdev, struct file *file)
> +{
> +	struct pci_p2pdma *p2pdma;
> +
> +	rcu_read_lock();
> +	p2pdma = rcu_dereference(pdev->p2pdma);
> +	if (p2pdma)
> +		file->f_mapping = p2pdma->inode->i_mapping;
> +	rcu_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_mmap_file_open);
> +
> +/**
> + * pci_mmap_p2pmem - setup an mmap region to be backed with P2PDMA memory
> + *	that was registered with pci_p2pdma_add_resource()
> + * @pdev: the device to allocate memory from
> + * @vma: the userspace vma to map the memory to
> + *
> + * The file must call pci_p2pdma_mmap_file_open() in its open() operation.
> + *
> + * Returns 0 on success, or a negative error code on failure
> + */
> +int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma)
> +{
> +	struct pci_p2pdma_map *pmap;
> +	struct pci_p2pdma *p2pdma;
> +	int ret;
> +
> +	/* prevent private mappings from being established */
> +	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
> +		pci_info_ratelimited(pdev,
> +				     "%s: fail, attempted private mapping\n",
> +				     current->comm);
> +		return -EINVAL;
> +	}
> +
> +	pmap = pci_p2pdma_map_alloc(pdev, vma->vm_end - vma->vm_start);
> +	if (!pmap)
> +		return -ENOMEM;
> +
> +	rcu_read_lock();
> +	p2pdma = rcu_dereference(pdev->p2pdma);
> +	if (!p2pdma) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	ret = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt,
> +			    &pci_p2pdma_fs_cnt);
> +	if (ret)
> +		goto out;
> +
> +	ihold(p2pdma->inode);
> +	pmap->inode = p2pdma->inode;
> +	rcu_read_unlock();
> +
> +	vma->vm_flags |= VM_MIXEDMAP;
> +	vma->vm_private_data = pmap;
> +	vma->vm_ops = &pci_p2pdma_vmops;
> +
> +	return 0;
> +
> +out:
> +	rcu_read_unlock();
> +	kfree(pmap);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pci_mmap_p2pmem);
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 0c33a40a86e7..f9f19f3db676 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -81,6 +81,8 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
>  			    bool *use_p2pdma);
>  ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
>  			       bool use_p2pdma);
> +void pci_p2pdma_mmap_file_open(struct pci_dev *pdev, struct file *file);
> +int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma);
>  #else /* CONFIG_PCI_P2PDMA */
>  static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
>  		size_t size, u64 offset)
> @@ -152,6 +154,15 @@ static inline ssize_t pci_p2pdma_enable_show(char *page,
>  {
>  	return sprintf(page, "none\n");
>  }
> +static inline void pci_p2pdma_mmap_file_open(struct pci_dev *pdev,
> +					     struct file *file)
> +{
> +}
> +static inline int pci_mmap_p2pmem(struct pci_dev *pdev,
> +				  struct vm_area_struct *vma)
> +{
> +	return -EOPNOTSUPP;
> +}
>  #endif /* CONFIG_PCI_P2PDMA */
>  
>  
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 35687dcb1a42..af737842c56f 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -88,6 +88,7 @@
>  #define BPF_FS_MAGIC		0xcafe4a11
>  #define AAFS_MAGIC		0x5a3c69f0
>  #define ZONEFS_MAGIC		0x5a4f4653
> +#define P2PDMA_MAGIC		0x70327064
>  
>  /* Since UDF 2.01 is ISO 13346 based... */
>  #define UDF_SUPER_MAGIC		0x15013346
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 02/20] PCI/P2PDMA: attempt to set map_type if it has not been set
  2021-09-16 23:40 ` [PATCH v3 02/20] PCI/P2PDMA: attempt to set map_type if it has not been set Logan Gunthorpe
@ 2021-09-27 18:50   ` Bjorn Helgaas
  0 siblings, 0 replies; 87+ messages in thread
From: Bjorn Helgaas @ 2021-09-27 18:50 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:42PM -0600, Logan Gunthorpe wrote:
> Attempt to find the mapping type for P2PDMA pages on the first
> DMA map attempt if it has not been done ahead of time.
> 
> Previously, the mapping type was expected to be calculated ahead of
> time, but if pages are to come from userspace then there's no
> way to ensure the path was checked ahead of time.
> 
> With this change it's no longer invalid to call pci_p2pdma_map_sg()
> before the mapping type is calculated so drop the WARN_ON when that
> is the case.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Capitalize subject line.

> ---
>  drivers/pci/p2pdma.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 50cdde3e9a8b..1192c465ba6d 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -848,6 +848,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
>  	struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
>  	struct pci_dev *client;
>  	struct pci_p2pdma *p2pdma;
> +	int dist;
>  
>  	if (!provider->p2pdma)
>  		return PCI_P2PDMA_MAP_NOT_SUPPORTED;
> @@ -864,6 +865,10 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
>  		type = xa_to_value(xa_load(&p2pdma->map_types,
>  					   map_types_idx(client)));
>  	rcu_read_unlock();
> +
> +	if (type == PCI_P2PDMA_MAP_UNKNOWN)
> +		return calc_map_type_and_dist(provider, client, &dist, false);
> +
>  	return type;
>  }
>  
> @@ -906,7 +911,6 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
>  	case PCI_P2PDMA_MAP_BUS_ADDR:
>  		return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
>  	default:
> -		WARN_ON_ONCE(1);
>  		return 0;
>  	}
>  }
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 13/20] PCI/P2PDMA: remove pci_p2pdma_[un]map_sg()
  2021-09-16 23:40 ` [PATCH v3 13/20] PCI/P2PDMA: remove pci_p2pdma_[un]map_sg() Logan Gunthorpe
@ 2021-09-27 18:50   ` Bjorn Helgaas
  2021-09-28 19:43   ` Jason Gunthorpe
  2021-10-05 22:42   ` Max Gurtovoy
  2 siblings, 0 replies; 87+ messages in thread
From: Bjorn Helgaas @ 2021-09-27 18:50 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:53PM -0600, Logan Gunthorpe wrote:
> This interface is superseded by support in dma_map_sg() which now supports
> heterogeneous scatterlists. There are no longer any users, so remove it.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Ditto.

> ---
>  drivers/pci/p2pdma.c       | 65 --------------------------------------
>  include/linux/pci-p2pdma.h | 27 ----------------
>  2 files changed, 92 deletions(-)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 58c34f1f1473..4478633346bd 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -878,71 +878,6 @@ enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
>  	return type;
>  }
>  
> -static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
> -		struct device *dev, struct scatterlist *sg, int nents)
> -{
> -	struct scatterlist *s;
> -	int i;
> -
> -	for_each_sg(sg, s, nents, i) {
> -		s->dma_address = sg_phys(s) - p2p_pgmap->bus_offset;
> -		sg_dma_len(s) = s->length;
> -	}
> -
> -	return nents;
> -}
> -
> -/**
> - * pci_p2pdma_map_sg_attrs - map a PCI peer-to-peer scatterlist for DMA
> - * @dev: device doing the DMA request
> - * @sg: scatter list to map
> - * @nents: elements in the scatterlist
> - * @dir: DMA direction
> - * @attrs: DMA attributes passed to dma_map_sg() (if called)
> - *
> - * Scatterlists mapped with this function should be unmapped using
> - * pci_p2pdma_unmap_sg_attrs().
> - *
> - * Returns the number of SG entries mapped or 0 on error.
> - */
> -int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
> -		int nents, enum dma_data_direction dir, unsigned long attrs)
> -{
> -	struct pci_p2pdma_pagemap *p2p_pgmap =
> -		to_p2p_pgmap(sg_page(sg)->pgmap);
> -
> -	switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
> -	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> -		return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
> -	case PCI_P2PDMA_MAP_BUS_ADDR:
> -		return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
> -	default:
> -		return 0;
> -	}
> -}
> -EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs);
> -
> -/**
> - * pci_p2pdma_unmap_sg_attrs - unmap a PCI peer-to-peer scatterlist that was
> - *	mapped with pci_p2pdma_map_sg()
> - * @dev: device doing the DMA request
> - * @sg: scatter list to map
> - * @nents: number of elements returned by pci_p2pdma_map_sg()
> - * @dir: DMA direction
> - * @attrs: DMA attributes passed to dma_unmap_sg() (if called)
> - */
> -void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> -		int nents, enum dma_data_direction dir, unsigned long attrs)
> -{
> -	enum pci_p2pdma_map_type map_type;
> -
> -	map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
> -
> -	if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
> -		dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
> -}
> -EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
> -
>  /**
>   * pci_p2pdma_map_segment - map an sg segment determining the mapping type
>   * @state: State structure that should be declared outside of the for_each_sg()
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index e5a8d5bc0f51..0c33a40a86e7 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -72,10 +72,6 @@ void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
>  void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
>  enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
>  					     struct device *dev);
> -int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
> -		int nents, enum dma_data_direction dir, unsigned long attrs);
> -void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> -		int nents, enum dma_data_direction dir, unsigned long attrs);
>  enum pci_p2pdma_map_type
>  pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
>  		       struct scatterlist *sg);
> @@ -135,17 +131,6 @@ pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
>  {
>  	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
>  }
> -static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
> -		struct scatterlist *sg, int nents, enum dma_data_direction dir,
> -		unsigned long attrs)
> -{
> -	return 0;
> -}
> -static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
> -		struct scatterlist *sg, int nents, enum dma_data_direction dir,
> -		unsigned long attrs)
> -{
> -}
>  static inline enum pci_p2pdma_map_type
>  pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
>  		       struct scatterlist *sg)
> @@ -181,16 +166,4 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client)
>  	return pci_p2pmem_find_many(&client, 1);
>  }
>  
> -static inline int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg,
> -				    int nents, enum dma_data_direction dir)
> -{
> -	return pci_p2pdma_map_sg_attrs(dev, sg, nents, dir, 0);
> -}
> -
> -static inline void pci_p2pdma_unmap_sg(struct device *dev,
> -		struct scatterlist *sg, int nents, enum dma_data_direction dir)
> -{
> -	pci_p2pdma_unmap_sg_attrs(dev, sg, nents, dir, 0);
> -}
> -
>  #endif /* _LINUX_PCI_P2P_H */
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
  2021-09-16 23:40 ` [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations Logan Gunthorpe
@ 2021-09-27 18:53   ` Bjorn Helgaas
  2021-09-27 19:59     ` Logan Gunthorpe
  2021-09-28 18:55   ` Jason Gunthorpe
  2021-09-28 22:05   ` [PATCH v3 4/20] " Jason Gunthorpe
  2 siblings, 1 reply; 87+ messages in thread
From: Bjorn Helgaas @ 2021-09-27 18:53 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:44PM -0600, Logan Gunthorpe wrote:
> Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
> implementations. It takes an scatterlist segment that must point to a
> pci_p2pdma struct page and will map it if the mapping requires a bus
> address.
> 
> The return value indicates whether the mapping required a bus address
> or whether the caller still needs to map the segment normally. If the
> segment should not be mapped, -EREMOTEIO is returned.
> 
> This helper uses a state structure to track the changes to the
> pgmap across calls and avoid needing to lookup into the xarray for
> every page.
> 
> Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
> dma_map_sg() implementations where the sg segment containing the page
> differs from the sg segment containing the DMA address.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Ditto.

> ---
>  drivers/pci/p2pdma.c       | 59 ++++++++++++++++++++++++++++++++++++++
>  include/linux/pci-p2pdma.h | 21 ++++++++++++++
>  2 files changed, 80 insertions(+)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index b656d8c801a7..58c34f1f1473 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -943,6 +943,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>  
> +/**
> + * pci_p2pdma_map_segment - map an sg segment determining the mapping type
> + * @state: State structure that should be declared outside of the for_each_sg()
> + *	loop and initialized to zero.
> + * @dev: DMA device that's doing the mapping operation
> + * @sg: scatterlist segment to map
> + *
> + * This is a helper to be used by non-iommu dma_map_sg() implementations where
> + * the sg segment is the same for the page_link and the dma_address.

s/non-iommu/non-IOMMU/

> + *
> + * Attempt to map a single segment in an SGL with the PCI bus address.
> + * The segment must point to a PCI P2PDMA page and thus must be
> + * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
> + *
> + * Returns the type of mapping used and maps the page if the type is
> + * PCI_P2PDMA_MAP_BUS_ADDR.
> + */
> +enum pci_p2pdma_map_type
> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
> +		       struct scatterlist *sg)
> +{
> +	if (state->pgmap != sg_page(sg)->pgmap) {
> +		state->pgmap = sg_page(sg)->pgmap;
> +		state->map = pci_p2pdma_map_type(state->pgmap, dev);
> +		state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
> +	}
> +
> +	if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
> +		sg->dma_address = sg_phys(sg) + state->bus_off;
> +		sg_dma_len(sg) = sg->length;
> +		sg_dma_mark_pci_p2pdma(sg);
> +	}
> +
> +	return state->map;
> +}
> +
> +/**
> + * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
> + *	be mapped with PCI_P2PDMA_MAP_BUS_ADDR
> + * @pg_sg: scatterlist segment with the page to map
> + * @dma_sg: scatterlist segment to assign a dma address to

s/dma address/DMA address/, also below

> + *
> + * This is a helper for iommu dma_map_sg() implementations when the
> + * segment for the dma address differs from the segment containing the
> + * source page.
> + *
> + * pci_p2pdma_map_type() must have already been called on the pg_sg and
> + * returned PCI_P2PDMA_MAP_BUS_ADDR.
> + */
> +void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
> +				struct scatterlist *dma_sg)
> +{
> +	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap);
> +
> +	dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset;
> +	sg_dma_len(dma_sg) = pg_sg->length;
> +	sg_dma_mark_pci_p2pdma(dma_sg);
> +}
> +
>  /**
>   * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
>   *		to enable p2pdma
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index caac2d023f8f..e5a8d5bc0f51 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -13,6 +13,12 @@
>  
>  #include <linux/pci.h>
>  
> +struct pci_p2pdma_map_state {
> +	struct dev_pagemap *pgmap;
> +	int map;
> +	u64 bus_off;
> +};
> +
>  struct block_device;
>  struct scatterlist;
>  
> @@ -70,6 +76,11 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
>  		int nents, enum dma_data_direction dir, unsigned long attrs);
>  void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>  		int nents, enum dma_data_direction dir, unsigned long attrs);
> +enum pci_p2pdma_map_type
> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
> +		       struct scatterlist *sg);
> +void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
> +				struct scatterlist *dma_sg);
>  int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
>  			    bool *use_p2pdma);
>  ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> @@ -135,6 +146,16 @@ static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
>  		unsigned long attrs)
>  {
>  }
> +static inline enum pci_p2pdma_map_type
> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
> +		       struct scatterlist *sg)
> +{
> +	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
> +}
> +static inline void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
> +					      struct scatterlist *dma_sg)
> +{
> +}
>  static inline int pci_p2pdma_enable_store(const char *page,
>  		struct pci_dev **p2p_dev, bool *use_p2pdma)
>  {
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
  2021-09-27 18:53   ` Bjorn Helgaas
@ 2021-09-27 19:59     ` Logan Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-27 19:59 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni



On 2021-09-27 12:53 p.m., Bjorn Helgaas wrote:
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> Ditto.

Thanks Bjorn, I'll make these changes and add your Acks for subsequent
postings.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  2021-09-16 23:40 ` [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
@ 2021-09-28 18:32   ` Jason Gunthorpe
  2021-09-29 21:15     ` Logan Gunthorpe
  2021-09-30  4:47   ` Chaitanya Kulkarni
  2021-09-30  4:57   ` Chaitanya Kulkarni
  2 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 18:32 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:41PM -0600, Logan Gunthorpe wrote:
> Make use of the third free LSB in scatterlist's page_link on 64bit systems.
> 
> The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
> given SGL segments dma_address points to a PCI bus address.
> dma_unmap_sg_p2pdma() will need to perform different cleanup when a
> segment is marked as P2PDMA.
> 
> Using this bit requires adding an additional dependency on CONFIG_64BIT to
> CONFIG_PCI_P2PDMA. This should be acceptable as the majority of P2PDMA
> use cases are restricted to newer root complexes and roughly require the
> extra address space for memory BARs used in the transactions.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
>  drivers/pci/Kconfig         |  2 +-
>  include/linux/scatterlist.h | 50 ++++++++++++++++++++++++++++++++++---
>  2 files changed, 47 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 0c473d75e625..90b4bddb3300 100644
> +++ b/drivers/pci/Kconfig
> @@ -163,7 +163,7 @@ config PCI_PASID
>  
>  config PCI_P2PDMA
>  	bool "PCI peer-to-peer transfer support"
> -	depends on ZONE_DEVICE
> +	depends on ZONE_DEVICE && 64BIT

Perhaps a comment to explain what the 64bit is doing?

>  	select GENERIC_ALLOCATOR
>  	help
>  	  Enableѕ drivers to do PCI peer-to-peer transactions to and from
> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
> index 266754a55327..e62b1cf6386f 100644
> +++ b/include/linux/scatterlist.h
> @@ -64,6 +64,21 @@ struct sg_append_table {
>  #define SG_CHAIN	0x01UL
>  #define SG_END		0x02UL
>  
> +/*
> + * bit 2 is the third free bit in the page_link on 64bit systems which
> + * is used by dma_unmap_sg() to determine if the dma_address is a PCI
> + * bus address when doing P2PDMA.
> + * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
> + */
> +
> +#ifdef CONFIG_PCI_P2PDMA
> +#define SG_DMA_PCI_P2PDMA	0x04UL

Add a 
	static_assert(__alignof__(void *) == 8);

?

> +#else
> +#define SG_DMA_PCI_P2PDMA	0x00UL
> +#endif
> +
> +#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_PCI_P2PDMA)
> +
>  /*
>   * We overload the LSB of the page pointer to indicate whether it's
>   * a valid sg entry, or whether it points to the start of a new scatterlist.
> @@ -72,7 +87,9 @@ struct sg_append_table {
>  #define sg_is_chain(sg)		((sg)->page_link & SG_CHAIN)
>  #define sg_is_last(sg)		((sg)->page_link & SG_END)
>  #define sg_chain_ptr(sg)	\
> -	((struct scatterlist *) ((sg)->page_link & ~(SG_CHAIN | SG_END)))
> +	((struct scatterlist *)((sg)->page_link & ~SG_PAGE_LINK_MASK))
> +
> +#define sg_is_dma_pci_p2pdma(sg) ((sg)->page_link & SG_DMA_PCI_P2PDMA)

I've been encouraging people to use static inlines more..

static inline unsigned int __sg_flags(struct scatterlist *sg)
{
	return sg->page_link & SG_PAGE_LINK_MASK;
}
static inline bool sg_is_chain(struct scatterlist *sg)
{
	return __sg_flags(sg) & SG_CHAIN;
}
static inline bool sg_is_last(struct scatterlist *sg)
{
	return __sg_flags(sg) & SG_END;
}
static inline bool sg_is_dma_pci_p2pdma(struct scatterlist *sg)
{
	return __sg_flags(sg) & SG_DMA_PCI_P2PDMA;
}

>  /**
>   * sg_assign_page - Assign a given page to an SG entry
> @@ -86,13 +103,13 @@ struct sg_append_table {
>   **/
>  static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
>  {
> -	unsigned long page_link = sg->page_link & (SG_CHAIN | SG_END);
> +	unsigned long page_link = sg->page_link & SG_PAGE_LINK_MASK;

I think this should just be '& SG_END', sg_assign_page() doesn't look
like it should ever be used on a sg_chain entry, so this is just
trying to preserve the end stamp.

Anyway, this looks OK

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 03/20] PCI/P2PDMA: make pci_p2pdma_map_type() non-static
  2021-09-16 23:40 ` [PATCH v3 03/20] PCI/P2PDMA: make pci_p2pdma_map_type() non-static Logan Gunthorpe
  2021-09-27 18:46   ` Bjorn Helgaas
@ 2021-09-28 18:48   ` Jason Gunthorpe
  1 sibling, 0 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 18:48 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:43PM -0600, Logan Gunthorpe wrote:
> +enum pci_p2pdma_map_type {
> +	/*
> +	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
> +	 * type hasn't been calculated yet. Functions that return this enum
> +	 * never return this value.
> +	 */
> +	PCI_P2PDMA_MAP_UNKNOWN = 0,
> +
> +	/*
> +	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
> +	 * traverse the host bridge and the host bridge is not in the
> +	 * whitelist. DMA Mapping routines should return an error when

I gather we are supposed to type allowlist now

> +	 * this is returned.
> +	 */
> +	PCI_P2PDMA_MAP_NOT_SUPPORTED,
> +
> +	/*
> +	 * PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
> +	 * eachother directly through a PCI switch and the transaction will

'each other'

> +	 * not traverse the host bridge. Such a mapping should program
> +	 * the DMA engine with PCI bus addresses.
> +	 */
> +	PCI_P2PDMA_MAP_BUS_ADDR,
> +
> +	/*
> +	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
> +	 * to eachother, but the transaction traverses a host bridge on the

'each other'

> +	 * whitelist. In this case, a normal mapping either with CPU physical
> +	 * addresses (in the case of dma-direct) or IOVA addresses (in the
> +	 * case of IOMMUs) should be used to program the DMA engine.
> +	 */
> +	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
> +};

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
  2021-09-16 23:40 ` [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations Logan Gunthorpe
  2021-09-27 18:53   ` Bjorn Helgaas
@ 2021-09-28 18:55   ` Jason Gunthorpe
  2021-09-29 21:26     ` Logan Gunthorpe
  2021-09-28 22:05   ` [PATCH v3 4/20] " Jason Gunthorpe
  2 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 18:55 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:44PM -0600, Logan Gunthorpe wrote:
> Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
> implementations. It takes an scatterlist segment that must point to a
> pci_p2pdma struct page and will map it if the mapping requires a bus
> address.
> 
> The return value indicates whether the mapping required a bus address
> or whether the caller still needs to map the segment normally. If the
> segment should not be mapped, -EREMOTEIO is returned.
> 
> This helper uses a state structure to track the changes to the
> pgmap across calls and avoid needing to lookup into the xarray for
> every page.
> 
> Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
> dma_map_sg() implementations where the sg segment containing the page
> differs from the sg segment containing the DMA address.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>  drivers/pci/p2pdma.c       | 59 ++++++++++++++++++++++++++++++++++++++
>  include/linux/pci-p2pdma.h | 21 ++++++++++++++
>  2 files changed, 80 insertions(+)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index b656d8c801a7..58c34f1f1473 100644
> +++ b/drivers/pci/p2pdma.c
> @@ -943,6 +943,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>  }
>  EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>  
> +/**
> + * pci_p2pdma_map_segment - map an sg segment determining the mapping type
> + * @state: State structure that should be declared outside of the for_each_sg()
> + *	loop and initialized to zero.
> + * @dev: DMA device that's doing the mapping operation
> + * @sg: scatterlist segment to map
> + *
> + * This is a helper to be used by non-iommu dma_map_sg() implementations where
> + * the sg segment is the same for the page_link and the dma_address.
> + *
> + * Attempt to map a single segment in an SGL with the PCI bus address.
> + * The segment must point to a PCI P2PDMA page and thus must be
> + * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
> + *
> + * Returns the type of mapping used and maps the page if the type is
> + * PCI_P2PDMA_MAP_BUS_ADDR.
> + */
> +enum pci_p2pdma_map_type
> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
> +		       struct scatterlist *sg)
> +{
> +	if (state->pgmap != sg_page(sg)->pgmap) {
> +		state->pgmap = sg_page(sg)->pgmap;
> +		state->map = pci_p2pdma_map_type(state->pgmap, dev);
> +		state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
> +	}

Is this safe? I was just talking with Joao about this,

 https://lore.kernel.org/r/20210928180150.GI3544071@ziepe.ca

API wise I absolutely think this should be safe as written, but is it
really?

Does pgmap ensure that a positive refcount struct page always has a
valid pgmap pointer (and thus the mess in gup can be deleted) or does
this need to get the pgmap as well to keep it alive?

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 05/20] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
  2021-09-16 23:40 ` [PATCH v3 05/20] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers Logan Gunthorpe
@ 2021-09-28 18:57   ` Jason Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 18:57 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:45PM -0600, Logan Gunthorpe wrote:
> Add EREMOTEIO error return to dma_map_sgtable() which will be used
> by .map_sg() implementations that detect P2PDMA pages that the
> underlying DMA device cannot access.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  kernel/dma/mapping.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 06/20] dma-direct: support PCI P2PDMA pages in dma-direct map_sg
  2021-09-16 23:40 ` [PATCH v3 06/20] dma-direct: support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
@ 2021-09-28 19:08   ` Jason Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 19:08 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:46PM -0600, Logan Gunthorpe wrote:
> Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
> PCI P2PDMA pages directly without a hack in the callers. This allows
> for heterogeneous SGLs that contain both P2PDMA and regular pages.
> 
> A P2PDMA page may have three possible outcomes when being mapped:
>   1) If the data path between the two devices doesn't go through the
>      root port, then it should be mapped with a PCI bus address
>   2) If the data path goes through the host bridge, it should be mapped
>      normally, as though it were a CPU physical address
>   3) It is not possible for the two devices to communicate and thus
>      the mapping operation should fail (and it will return -EREMOTEIO).
> 
> SGL segments that contain PCI bus addresses are marked with
> sg_dma_mark_pci_p2pdma() and are ignored when unmapped.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>  kernel/dma/direct.c | 44 ++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 38 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 4c6c5e0635e3..fa8317e8ff44 100644
> +++ b/kernel/dma/direct.c
> @@ -13,6 +13,7 @@
>  #include <linux/vmalloc.h>
>  #include <linux/set_memory.h>
>  #include <linux/slab.h>
> +#include <linux/pci-p2pdma.h>
>  #include "direct.h"
>  
>  /*
> @@ -421,29 +422,60 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
>  		arch_sync_dma_for_cpu_all();
>  }
>  
> +/*
> + * Unmaps segments, except for ones marked as pci_p2pdma which do not
> + * require any further action as they contain a bus address.
> + */
>  void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
>  		int nents, enum dma_data_direction dir, unsigned long attrs)
>  {
>  	struct scatterlist *sg;
>  	int i;
>  
> -	for_each_sg(sgl, sg, nents, i)
> -		dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
> -			     attrs);
> +	for_each_sg(sgl, sg, nents, i) {
> +		if (sg_is_dma_pci_p2pdma(sg)) {
> +			sg_dma_unmark_pci_p2pdma(sg);
> +		} else  {
> +			dma_direct_unmap_page(dev, sg->dma_address,
> +					      sg_dma_len(sg), dir, attrs);
> +		}

If the main usage of this SGL bit is to indicate if it has been DMA
mapped, or not, I think it should be renamed to something clearer.

p2pdma is being used for lots of things now, it feels very
counter-intuitive that P2PDMA pages are not flagged with
something called sg_is_dma_pci_p2pdma().

How about sg_is_dma_unmapped_address() ?
>  
>  	for_each_sg(sgl, sg, nents, i) {
> +		if (is_pci_p2pdma_page(sg_page(sg))) {
> +			map = pci_p2pdma_map_segment(&p2pdma_state, dev, sg);
> +			switch (map) {
> +			case PCI_P2PDMA_MAP_BUS_ADDR:
> +				continue;
> +			case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> +				/*
> +				 * Mapping through host bridge should be
> +				 * mapped normally, thus we do nothing
> +				 * and continue below.
> +				 */
> +				break;
> +			default:
> +				ret = -EREMOTEIO;
> +				goto out_unmap;
> +			}
> +		}
> +
>  		sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
>  				sg->offset, sg->length, dir, attrs);

dma_direct_map_page() can trigger swiotlb and I didn't see this series
dealing with that?

It would probably be fine for now to fail swiotlb_map() for p2p pages?

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 07/20] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
  2021-09-16 23:40 ` [PATCH v3 07/20] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
@ 2021-09-28 19:11   ` Jason Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 19:11 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:47PM -0600, Logan Gunthorpe wrote:
> Add a flags member to the dma_map_ops structure with one flag to
> indicate support for PCI P2PDMA.
> 
> Also, add a helper to check if a device supports PCI P2PDMA.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  include/linux/dma-map-ops.h | 10 ++++++++++
>  include/linux/dma-mapping.h |  5 +++++
>  kernel/dma/mapping.c        | 18 ++++++++++++++++++
>  3 files changed, 33 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 08/20] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
  2021-09-16 23:40 ` [PATCH v3 08/20] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg Logan Gunthorpe
@ 2021-09-28 19:15   ` Jason Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 19:15 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:48PM -0600, Logan Gunthorpe wrote:
> When a PCI P2PDMA page is seen, set the IOVA length of the segment
> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
> apply the appropriate bus address to the segment. The IOVA is not
> created if the scatterlist only consists of P2PDMA pages.
> 
> A P2PDMA page may have three possible outcomes when being mapped:
>   1) If the data path between the two devices doesn't go through
>      the root port, then it should be mapped with a PCI bus address
>   2) If the data path goes through the host bridge, it should be mapped
>      normally with an IOMMU IOVA.
>   3) It is not possible for the two devices to communicate and thus
>      the mapping operation should fail (and it will return -EREMOTEIO).
> 
> Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
> indicate bus address segments. On unmap, P2PDMA segments are skipped
> over when determining the start and end IOVA addresses.
> 
> With this change, the flags variable in the dma_map_ops is set to
> DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/iommu/dma-iommu.c | 68 +++++++++++++++++++++++++++++++++++----
>  1 file changed, 61 insertions(+), 7 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 11/20] RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
  2021-09-16 23:40 ` [PATCH v3 11/20] RDMA/core: introduce ib_dma_pci_p2p_dma_supported() Logan Gunthorpe
@ 2021-09-28 19:17   ` Jason Gunthorpe
  2021-10-05 22:31   ` Max Gurtovoy
  1 sibling, 0 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 19:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:51PM -0600, Logan Gunthorpe wrote:
> Introduce the helper function ib_dma_pci_p2p_dma_supported() to check
> if a given ib_device can be used in P2PDMA transfers. This ensures
> the ib_device is not using virt_dma and also that the underlying
> dma_device supports P2PDMA.
> 
> Use the new helper in nvme-rdma to replace the existing check for
> ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows
> switching away from pci_p2pdma_[un]map_sg().
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/nvme/target/rdma.c |  2 +-
>  include/rdma/ib_verbs.h    | 11 +++++++++++
>  2 files changed, 12 insertions(+), 1 deletion(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

> +/*
> + * Check if a IB device's underlying DMA mapping supports P2PDMA transfers.
> + */
> +static inline bool ib_dma_pci_p2p_dma_supported(struct ib_device *dev)
> +{
> +	if (ib_uses_virt_dma(dev))
> +		return false;

If someone wants to make rxe/hfi/qib use this stuff then they will
have to teach the the driver to do all the p2p checks and add some
struct ib_device flag

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 12/20] RDMA/rw: use dma_map_sgtable()
  2021-09-16 23:40 ` [PATCH v3 12/20] RDMA/rw: use dma_map_sgtable() Logan Gunthorpe
@ 2021-09-28 19:43   ` Jason Gunthorpe
  2021-09-29 22:56     ` Logan Gunthorpe
  2021-10-05 22:40     ` Max Gurtovoy
  0 siblings, 2 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 19:43 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:52PM -0600, Logan Gunthorpe wrote:
> dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
> is no longer necessary and may be dropped.
> 
> Switch to the dma_map_sgtable() interface which will allow for better
> error reporting if the P2PDMA pages are unsupported.
> 
> The change to sgtable also appears to fix a couple subtle error path
> bugs:
> 
>   - In rdma_rw_ctx_init(), dma_unmap would be called with an sg
>     that could have been incremented from the original call, as
>     well as an nents that was not the original number of nents
>     called when mapped.
>   - Similarly in rdma_rw_ctx_signature_init, both sg and prot_sg
>     were unmapped with the incorrect number of nents.

Those bugs should definately get fixed.. I might extract the sgtable
conversion into a stand alone patch to do it.

But as it is, this looks fine

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 13/20] PCI/P2PDMA: remove pci_p2pdma_[un]map_sg()
  2021-09-16 23:40 ` [PATCH v3 13/20] PCI/P2PDMA: remove pci_p2pdma_[un]map_sg() Logan Gunthorpe
  2021-09-27 18:50   ` Bjorn Helgaas
@ 2021-09-28 19:43   ` Jason Gunthorpe
  2021-10-05 22:42   ` Max Gurtovoy
  2 siblings, 0 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 19:43 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:53PM -0600, Logan Gunthorpe wrote:
> This interface is superseded by support in dma_map_sg() which now supports
> heterogeneous scatterlists. There are no longer any users, so remove it.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/p2pdma.c       | 65 --------------------------------------
>  include/linux/pci-p2pdma.h | 27 ----------------
>  2 files changed, 92 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Good riddance :)

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2021-09-16 23:40 ` [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
@ 2021-09-28 19:47   ` Jason Gunthorpe
  2021-09-29 21:34     ` Logan Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 19:47 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:54PM -0600, Logan Gunthorpe wrote:
> Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
> allow obtaining P2PDMA pages. If a caller does not set this flag
> and tries to map P2PDMA pages it will fail.
> 
> This is implemented by adding a flag and a check to get_dev_pagemap().

I would like to see the get_dev_pagemap() deleted from GUP in the
first place.

Why isn't this just a simple check of the page->pgmap type after
acquiring a valid page reference? See my prior note

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-09-16 23:40 ` [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem() Logan Gunthorpe
  2021-09-27 18:49   ` Bjorn Helgaas
@ 2021-09-28 19:55   ` Jason Gunthorpe
  2021-09-29 21:42     ` Logan Gunthorpe
  2021-09-28 20:05   ` Jason Gunthorpe
  2 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 19:55 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:59PM -0600, Logan Gunthorpe wrote:
> +int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma)
> +{
> +	struct pci_p2pdma_map *pmap;
> +	struct pci_p2pdma *p2pdma;
> +	int ret;
> +
> +	/* prevent private mappings from being established */
> +	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
> +		pci_info_ratelimited(pdev,
> +				     "%s: fail, attempted private mapping\n",
> +				     current->comm);
> +		return -EINVAL;
> +	}
> +
> +	pmap = pci_p2pdma_map_alloc(pdev, vma->vm_end - vma->vm_start);
> +	if (!pmap)
> +		return -ENOMEM;
> +
> +	rcu_read_lock();
> +	p2pdma = rcu_dereference(pdev->p2pdma);
> +	if (!p2pdma) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	ret = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt,
> +			    &pci_p2pdma_fs_cnt);
> +	if (ret)
> +		goto out;
> +
> +	ihold(p2pdma->inode);
> +	pmap->inode = p2pdma->inode;
> +	rcu_read_unlock();
> +
> +	vma->vm_flags |= VM_MIXEDMAP;

Why is this a VM_MIXEDMAP? Everything fault sticks in here has a
struct page, right?

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices
  2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (19 preceding siblings ...)
  2021-09-16 23:41 ` [PATCH v3 20/20] nvme-pci: allow mmaping the CMB in userspace Logan Gunthorpe
@ 2021-09-28 20:02 ` Jason Gunthorpe
  2021-09-29 21:50   ` Logan Gunthorpe
  20 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 20:02 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote:
> Hi,
> 
> This patchset continues my work to add userspace P2PDMA access using
> O_DIRECT NVMe devices. My last posting[1] just included the first 13
> patches in this series, but the early P2PDMA cleanup and map_sg error
> changes from that series have been merged into v5.15-rc1. To address
> concerns that that series did not add any new functionality, I've added
> back the userspcae functionality from the original RFC[2] (but improved
> based on the original feedback).

I really think this is the best series yet, it really looks nice
overall. I know the sg flag was a bit of a debate at the start, but it
serves an undeniable purpose and the resulting standard DMA APIs 'just
working' is really clean.

There is more possible here, we could also pass the new GUP flag in the
ib_umem code..

After this gets merged I would make a series to split out the CMD
genalloc related stuff and try and probably get something like VFIO to
export this kind of memory as well, then it would have pretty nice
coverage.

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-09-16 23:40 ` [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem() Logan Gunthorpe
  2021-09-27 18:49   ` Bjorn Helgaas
  2021-09-28 19:55   ` Jason Gunthorpe
@ 2021-09-28 20:05   ` Jason Gunthorpe
  2021-09-29 21:46     ` Logan Gunthorpe
  2 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 20:05 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:59PM -0600, Logan Gunthorpe wrote:

> +static void pci_p2pdma_unmap_mappings(void *data)
> +{
> +	struct pci_dev *pdev = data;
> +	struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
> +
> +	p2pdma->active = false;
> +	synchronize_rcu();
> +	unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
> +	pci_p2pdma_free_mappings(p2pdma->inode->i_mapping);
> +}

If this is going to rely on unmap_mapping_range then GUP should also
reject this memory for FOLL_LONGTERM..

What along this control flow:

> +       error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
> +                                        pdev);

Waits for all the page refcounts to go to zero?

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
  2021-09-16 23:40 ` [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations Logan Gunthorpe
  2021-09-27 18:53   ` Bjorn Helgaas
  2021-09-28 18:55   ` Jason Gunthorpe
@ 2021-09-28 22:05   ` Jason Gunthorpe
  2021-09-29 21:30     ` Logan Gunthorpe
  2 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 22:05 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Thu, Sep 16, 2021 at 05:40:44PM -0600, Logan Gunthorpe wrote:

> +enum pci_p2pdma_map_type
> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
> +		       struct scatterlist *sg)
> +{
> +	if (state->pgmap != sg_page(sg)->pgmap) {
> +		state->pgmap = sg_page(sg)->pgmap;

This has built into it an assumption that every page in the sg element
has the same pgmap, but AFAIK nothing enforces this rule? There is no
requirement that the HW has pfn gaps between the pgmaps linux decides
to create over it.

At least sg_alloc_append_table_from_pages() and probably something in
the block world should be updated to not combine struct pages with
different pgmaps, and this should be documented in scatterlist.*
someplace.

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  2021-09-28 18:32   ` Jason Gunthorpe
@ 2021-09-29 21:15     ` Logan Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 21:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni



On 2021-09-28 12:32 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:41PM -0600, Logan Gunthorpe wrote:
>>  config PCI_P2PDMA
>>  	bool "PCI peer-to-peer transfer support"
>> -	depends on ZONE_DEVICE
>> +	depends on ZONE_DEVICE && 64BIT
> 
> Perhaps a comment to explain what the 64bit is doing?

Added.

>>  	select GENERIC_ALLOCATOR
>>  	help
>>  	  Enableѕ drivers to do PCI peer-to-peer transactions to and from
>> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
>> index 266754a55327..e62b1cf6386f 100644
>> +++ b/include/linux/scatterlist.h
>> @@ -64,6 +64,21 @@ struct sg_append_table {
>>  #define SG_CHAIN	0x01UL
>>  #define SG_END		0x02UL
>>  
>> +/*
>> + * bit 2 is the third free bit in the page_link on 64bit systems which
>> + * is used by dma_unmap_sg() to determine if the dma_address is a PCI
>> + * bus address when doing P2PDMA.
>> + * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
>> + */
>> +
>> +#ifdef CONFIG_PCI_P2PDMA
>> +#define SG_DMA_PCI_P2PDMA	0x04UL
> 
> Add a 
> 	static_assert(__alignof__(void *) == 8);
> 
> ?

Good idea. Though, I think your line isn't quite correct. I've added:

static_assert(__alignof__(struct page) >= 8);

>> +#define sg_is_dma_pci_p2pdma(sg) ((sg)->page_link & SG_DMA_PCI_P2PDMA)
> 
> I've been encouraging people to use static inlines more..

I also prefer static inlines, but I usually follow the style of the code
I'm changing. In any case, I've changed to static inlines similar to
your example.

>>  /**
>>   * sg_assign_page - Assign a given page to an SG entry
>> @@ -86,13 +103,13 @@ struct sg_append_table {
>>   **/
>>  static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
>>  {
>> -	unsigned long page_link = sg->page_link & (SG_CHAIN | SG_END);
>> +	unsigned long page_link = sg->page_link & SG_PAGE_LINK_MASK;
> 
> I think this should just be '& SG_END', sg_assign_page() doesn't look
> like it should ever be used on a sg_chain entry, so this is just
> trying to preserve the end stamp.

Perhaps, but I'm not comfortable making that change in this patch or
series. Though, I've reverted this specific change in my patch so
sg_assign_page() will clear SG_DMA_PCI_P2PDMA.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
  2021-09-28 18:55   ` Jason Gunthorpe
@ 2021-09-29 21:26     ` Logan Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 21:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni



On 2021-09-28 12:55 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:44PM -0600, Logan Gunthorpe wrote:
>> Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
>> implementations. It takes an scatterlist segment that must point to a
>> pci_p2pdma struct page and will map it if the mapping requires a bus
>> address.
>>
>> The return value indicates whether the mapping required a bus address
>> or whether the caller still needs to map the segment normally. If the
>> segment should not be mapped, -EREMOTEIO is returned.
>>
>> This helper uses a state structure to track the changes to the
>> pgmap across calls and avoid needing to lookup into the xarray for
>> every page.
>>
>> Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
>> dma_map_sg() implementations where the sg segment containing the page
>> differs from the sg segment containing the DMA address.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>>  drivers/pci/p2pdma.c       | 59 ++++++++++++++++++++++++++++++++++++++
>>  include/linux/pci-p2pdma.h | 21 ++++++++++++++
>>  2 files changed, 80 insertions(+)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index b656d8c801a7..58c34f1f1473 100644
>> +++ b/drivers/pci/p2pdma.c
>> @@ -943,6 +943,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>>  }
>>  EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>>  
>> +/**
>> + * pci_p2pdma_map_segment - map an sg segment determining the mapping type
>> + * @state: State structure that should be declared outside of the for_each_sg()
>> + *	loop and initialized to zero.
>> + * @dev: DMA device that's doing the mapping operation
>> + * @sg: scatterlist segment to map
>> + *
>> + * This is a helper to be used by non-iommu dma_map_sg() implementations where
>> + * the sg segment is the same for the page_link and the dma_address.
>> + *
>> + * Attempt to map a single segment in an SGL with the PCI bus address.
>> + * The segment must point to a PCI P2PDMA page and thus must be
>> + * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
>> + *
>> + * Returns the type of mapping used and maps the page if the type is
>> + * PCI_P2PDMA_MAP_BUS_ADDR.
>> + */
>> +enum pci_p2pdma_map_type
>> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
>> +		       struct scatterlist *sg)
>> +{
>> +	if (state->pgmap != sg_page(sg)->pgmap) {
>> +		state->pgmap = sg_page(sg)->pgmap;
>> +		state->map = pci_p2pdma_map_type(state->pgmap, dev);
>> +		state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
>> +	}
> 
> Is this safe? I was just talking with Joao about this,
> 
>  https://lore.kernel.org/r/20210928180150.GI3544071@ziepe.ca
> 

I agree that taking the extra reference on the pagemap seems
unnecessary. However, it was convenient for the purposes of this
patchset to not have to check the page type for every page and only on
every new page map. But if we need to add a check directly to gup,
that'd probably be fine too.

> API wise I absolutely think this should be safe as written, but is it
> really?
> 
> Does pgmap ensure that a positive refcount struct page always has a
> valid pgmap pointer (and thus the mess in gup can be deleted) or does
> this need to get the pgmap as well to keep it alive?

Yes, the P2PDMA code ensures that the pgmap will not be deleted if there
is a positive refcount on any struct page.

LOgan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
  2021-09-28 22:05   ` [PATCH v3 4/20] " Jason Gunthorpe
@ 2021-09-29 21:30     ` Logan Gunthorpe
  2021-09-29 22:46       ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 21:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni




On 2021-09-28 4:05 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:44PM -0600, Logan Gunthorpe wrote:
> 
>> +enum pci_p2pdma_map_type
>> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
>> +		       struct scatterlist *sg)
>> +{
>> +	if (state->pgmap != sg_page(sg)->pgmap) {
>> +		state->pgmap = sg_page(sg)->pgmap;
> 
> This has built into it an assumption that every page in the sg element
> has the same pgmap, but AFAIK nothing enforces this rule? There is no
> requirement that the HW has pfn gaps between the pgmaps linux decides
> to create over it.

No, that's not a correct reading of the code. Every time there is a new
pagemap, this code calculates the mapping type and bus offset. If a page
comes along with a different page map,f it recalculates. This just
reduces the overhead so that the calculation is done only every time a
page with a different pgmap comes along and not doing it for every
single page.

> At least sg_alloc_append_table_from_pages() and probably something in
> the block world should be updated to not combine struct pages with
> different pgmaps, and this should be documented in scatterlist.*
> someplace.

There's no sane place to do this check. The code is designed to support
mappings with different pgmaps.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2021-09-28 19:47   ` Jason Gunthorpe
@ 2021-09-29 21:34     ` Logan Gunthorpe
  2021-09-29 22:48       ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 21:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni




On 2021-09-28 1:47 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:54PM -0600, Logan Gunthorpe wrote:
>> Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
>> allow obtaining P2PDMA pages. If a caller does not set this flag
>> and tries to map P2PDMA pages it will fail.
>>
>> This is implemented by adding a flag and a check to get_dev_pagemap().
> 
> I would like to see the get_dev_pagemap() deleted from GUP in the
> first place.
> 
> Why isn't this just a simple check of the page->pgmap type after
> acquiring a valid page reference? See my prior note

It could be, but that will mean dereferencing the pgmap for every page
to determine the type of page and then comparing with FOLL_PCI_P2PDMA.

Probably not terrible to go this way.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-09-28 19:55   ` Jason Gunthorpe
@ 2021-09-29 21:42     ` Logan Gunthorpe
  2021-09-29 23:05       ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 21:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni



On 2021-09-28 1:55 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:59PM -0600, Logan Gunthorpe wrote:
>> +int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma)
>> +{
>> +	struct pci_p2pdma_map *pmap;
>> +	struct pci_p2pdma *p2pdma;
>> +	int ret;
>> +
>> +	/* prevent private mappings from being established */
>> +	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
>> +		pci_info_ratelimited(pdev,
>> +				     "%s: fail, attempted private mapping\n",
>> +				     current->comm);
>> +		return -EINVAL;
>> +	}
>> +
>> +	pmap = pci_p2pdma_map_alloc(pdev, vma->vm_end - vma->vm_start);
>> +	if (!pmap)
>> +		return -ENOMEM;
>> +
>> +	rcu_read_lock();
>> +	p2pdma = rcu_dereference(pdev->p2pdma);
>> +	if (!p2pdma) {
>> +		ret = -ENODEV;
>> +		goto out;
>> +	}
>> +
>> +	ret = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt,
>> +			    &pci_p2pdma_fs_cnt);
>> +	if (ret)
>> +		goto out;
>> +
>> +	ihold(p2pdma->inode);
>> +	pmap->inode = p2pdma->inode;
>> +	rcu_read_unlock();
>> +
>> +	vma->vm_flags |= VM_MIXEDMAP;
> 
> Why is this a VM_MIXEDMAP? Everything fault sticks in here has a
> struct page, right?

Yes. This decision is not so simple, I tried a few variations before
settling on this.

The main reason is probably this: if we don't use VM_MIXEDMAP, then we
can't set pte_devmap(). If we don't set pte_devmap(), then every single
page that GUP processes needs to check if it's a ZONE_DEVICE page and
also if it's a P2PDMA page (thus dereferencing pgmap) in order to
satisfy the requirements of FOLL_PCI_P2PDMA.

I didn't think other developers would go for that kind of performance
hit. With VM_MIXEDMAP we hide the performance penalty behind the
existing check. And with the current pgmap code as is, we only need to
do that check once for every new pgmap pointer.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-09-28 20:05   ` Jason Gunthorpe
@ 2021-09-29 21:46     ` Logan Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 21:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni




On 2021-09-28 2:05 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:59PM -0600, Logan Gunthorpe wrote:
> 
>> +static void pci_p2pdma_unmap_mappings(void *data)
>> +{
>> +	struct pci_dev *pdev = data;
>> +	struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
>> +
>> +	p2pdma->active = false;
>> +	synchronize_rcu();
>> +	unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
>> +	pci_p2pdma_free_mappings(p2pdma->inode->i_mapping);
>> +}
> 
> If this is going to rely on unmap_mapping_range then GUP should also
> reject this memory for FOLL_LONGTERM..

Right, makes sense.

> 
> What along this control flow:
> 
>> +       error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
>> +                                        pdev);
> 
> Waits for all the page refcounts to go to zero?

That's already in the existing code as part of memunmap_pages() which
puts the original reference to all the pages and then waits for the
reference to go to zero.

This new action unmaps all the VMAs so that the subsequent call to
memunmap_pages() doesn't block on userspace processes.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices
  2021-09-28 20:02 ` [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Jason Gunthorpe
@ 2021-09-29 21:50   ` Logan Gunthorpe
  2021-09-29 23:21     ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 21:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni



On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote:
>> Hi,
>>
>> This patchset continues my work to add userspace P2PDMA access using
>> O_DIRECT NVMe devices. My last posting[1] just included the first 13
>> patches in this series, but the early P2PDMA cleanup and map_sg error
>> changes from that series have been merged into v5.15-rc1. To address
>> concerns that that series did not add any new functionality, I've added
>> back the userspcae functionality from the original RFC[2] (but improved
>> based on the original feedback).
> 
> I really think this is the best series yet, it really looks nice
> overall. I know the sg flag was a bit of a debate at the start, but it
> serves an undeniable purpose and the resulting standard DMA APIs 'just
> working' is really clean.

Actually, so far, nobody has said anything negative about using the SG flag.

> There is more possible here, we could also pass the new GUP flag in the
> ib_umem code..

Yes, that would be very useful.

> After this gets merged I would make a series to split out the CMD
> genalloc related stuff and try and probably get something like VFIO to
> export this kind of memory as well, then it would have pretty nice
> coverage.

Yup!

Thanks for the review. Anything I didn't respond to I've either made
changes for, or am still working on and will be addressed for subsequent
postings.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
  2021-09-29 21:30     ` Logan Gunthorpe
@ 2021-09-29 22:46       ` Jason Gunthorpe
  2021-09-29 23:00         ` Logan Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 22:46 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Wed, Sep 29, 2021 at 03:30:42PM -0600, Logan Gunthorpe wrote:
> 
> 
> 
> On 2021-09-28 4:05 p.m., Jason Gunthorpe wrote:
> > On Thu, Sep 16, 2021 at 05:40:44PM -0600, Logan Gunthorpe wrote:
> > 
> >> +enum pci_p2pdma_map_type
> >> +pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
> >> +		       struct scatterlist *sg)
> >> +{
> >> +	if (state->pgmap != sg_page(sg)->pgmap) {
> >> +		state->pgmap = sg_page(sg)->pgmap;
> > 
> > This has built into it an assumption that every page in the sg element
> > has the same pgmap, but AFAIK nothing enforces this rule? There is no
> > requirement that the HW has pfn gaps between the pgmaps linux decides
> > to create over it.
> 
> No, that's not a correct reading of the code. Every time there is a new
> pagemap, this code calculates the mapping type and bus offset. If a page
> comes along with a different page map,f it recalculates. This just
> reduces the overhead so that the calculation is done only every time a
> page with a different pgmap comes along and not doing it for every
> single page.

Each 'struct scatterlist *sg' refers to a range of contiguous pfns
starting at page_to_pfn(sg_page()) and going for approx sg->length/PAGE_SIZE
pfns long.

sg_page() returns the first page, but nothing says that sg_page()+1
has the same pgmap.

The code in this patch does check the first page of each sg in a
larger sgl.

> > At least sg_alloc_append_table_from_pages() and probably something in
> > the block world should be updated to not combine struct pages with
> > different pgmaps, and this should be documented in scatterlist.*
> > someplace.
> 
> There's no sane place to do this check. The code is designed to support
> mappings with different pgmaps.

All places that generate compound sg's by aggregating multiple pages
need to include this check along side the check for physical
contiguity. There are not that many places but
sg_alloc_append_table_from_pages() is one of them:

@@ -470,7 +470,8 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
 
                /* Merge contiguous pages into the last SG */
                prv_len = sgt_append->prv->length;
-               while (n_pages && page_to_pfn(pages[0]) == paddr) {
+               while (n_pages && page_to_pfn(pages[0]) == paddr &&
+                      sg_page(sgt_append->prv)->pgmap == pages[0]->pgmap) {
                        if (sgt_append->prv->length + PAGE_SIZE > max_segment)
                                break;
                        sgt_append->prv->length += PAGE_SIZE;
@@ -488,7 +489,8 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
        for (i = 1; i < n_pages; i++) {
                seg_len += PAGE_SIZE;
                if (seg_len >= max_segment ||
-                   page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1) {
+                   page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1 ||
+                   pages[i]->pgmap != pages[i - 1]->pgmap) {
                        chunks++;
                        seg_len = 0;
                }
@@ -505,9 +507,10 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
                        seg_len += PAGE_SIZE;
                        if (seg_len >= max_segment ||
                            page_to_pfn(pages[j]) !=
-                           page_to_pfn(pages[j - 1]) + 1)
+                                   page_to_pfn(pages[j - 1]) + 1 ||
+                           pages[i]->pgmap != pages[i - 1]->pgmap) {
                                break;
-               }
+                       }
 
                /* Pass how many chunks might be left */
                s = get_next_sg(sgt_append, s, chunks - i + left_pages,



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2021-09-29 21:34     ` Logan Gunthorpe
@ 2021-09-29 22:48       ` Jason Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 22:48 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Wed, Sep 29, 2021 at 03:34:22PM -0600, Logan Gunthorpe wrote:
> 
> 
> 
> On 2021-09-28 1:47 p.m., Jason Gunthorpe wrote:
> > On Thu, Sep 16, 2021 at 05:40:54PM -0600, Logan Gunthorpe wrote:
> >> Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
> >> allow obtaining P2PDMA pages. If a caller does not set this flag
> >> and tries to map P2PDMA pages it will fail.
> >>
> >> This is implemented by adding a flag and a check to get_dev_pagemap().
> > 
> > I would like to see the get_dev_pagemap() deleted from GUP in the
> > first place.
> > 
> > Why isn't this just a simple check of the page->pgmap type after
> > acquiring a valid page reference? See my prior note
> 
> It could be, but that will mean dereferencing the pgmap for every page
> to determine the type of page and then comparing with FOLL_PCI_P2PDMA.

It would be done under the pte devmap test and this is less expensive
than the xarray search.

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 12/20] RDMA/rw: use dma_map_sgtable()
  2021-09-28 19:43   ` Jason Gunthorpe
@ 2021-09-29 22:56     ` Logan Gunthorpe
  2021-10-05 22:40     ` Max Gurtovoy
  1 sibling, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 22:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni




On 2021-09-28 1:43 p.m., Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:52PM -0600, Logan Gunthorpe wrote:
>> dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
>> is no longer necessary and may be dropped.
>>
>> Switch to the dma_map_sgtable() interface which will allow for better
>> error reporting if the P2PDMA pages are unsupported.
>>
>> The change to sgtable also appears to fix a couple subtle error path
>> bugs:
>>
>>   - In rdma_rw_ctx_init(), dma_unmap would be called with an sg
>>     that could have been incremented from the original call, as
>>     well as an nents that was not the original number of nents
>>     called when mapped.
>>   - Similarly in rdma_rw_ctx_signature_init, both sg and prot_sg
>>     were unmapped with the incorrect number of nents.
> 
> Those bugs should definately get fixed.. I might extract the sgtable
> conversion into a stand alone patch to do it.

Yes. I can try to split it off myself and send a patch later this week.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
  2021-09-29 22:46       ` Jason Gunthorpe
@ 2021-09-29 23:00         ` Logan Gunthorpe
  2021-09-29 23:40           ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 23:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni




On 2021-09-29 4:46 p.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 03:30:42PM -0600, Logan Gunthorpe wrote:
>> On 2021-09-28 4:05 p.m., Jason Gunthorpe wrote:
>> No, that's not a correct reading of the code. Every time there is a new
>> pagemap, this code calculates the mapping type and bus offset. If a page
>> comes along with a different page map,f it recalculates. This just
>> reduces the overhead so that the calculation is done only every time a
>> page with a different pgmap comes along and not doing it for every
>> single page.
> 
> Each 'struct scatterlist *sg' refers to a range of contiguous pfns
> starting at page_to_pfn(sg_page()) and going for approx sg->length/PAGE_SIZE
> pfns long.
> 

Ugh, right. A bit contrived for consecutive pages to have different
pgmaps and still be next to each other in a DMA transaction. But I guess
it is technically possible and should be protected against.

> sg_page() returns the first page, but nothing says that sg_page()+1
> has the same pgmap.
> 
> The code in this patch does check the first page of each sg in a
> larger sgl.
> 
>>> At least sg_alloc_append_table_from_pages() and probably something in
>>> the block world should be updated to not combine struct pages with
>>> different pgmaps, and this should be documented in scatterlist.*
>>> someplace.
>>
>> There's no sane place to do this check. The code is designed to support
>> mappings with different pgmaps.
> 
> All places that generate compound sg's by aggregating multiple pages
> need to include this check along side the check for physical
> contiguity. There are not that many places but
> sg_alloc_append_table_from_pages() is one of them:

Yes. The block layer also does this. I believe a check in
page_is_mergable() will be sufficient there.

> @@ -470,7 +470,8 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
>  
>                 /* Merge contiguous pages into the last SG */
>                 prv_len = sgt_append->prv->length;
> -               while (n_pages && page_to_pfn(pages[0]) == paddr) {
> +               while (n_pages && page_to_pfn(pages[0]) == paddr &&
> +                      sg_page(sgt_append->prv)->pgmap == pages[0]->pgmap) {

I don't think it's correct to use pgmap without first checking if it is
a zone device page. But your point is taken. I'll try to address this.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-09-29 21:42     ` Logan Gunthorpe
@ 2021-09-29 23:05       ` Jason Gunthorpe
  2021-09-29 23:27         ` Logan Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 23:05 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Wed, Sep 29, 2021 at 03:42:00PM -0600, Logan Gunthorpe wrote:

> The main reason is probably this: if we don't use VM_MIXEDMAP, then we
> can't set pte_devmap(). 

I think that is an API limitation in the fault routines..

finish_fault() should set the pte_devmap - eg by passing the
PFN_DEV|PFN_MAP somehow through the vma->vm_page_prot to mk_pte() or
otherwise signaling do_set_pte() that it should set those PTE bits
when it creates the entry.

(or there should be a vmf_* helper for this special case, but using
the vmf->page seems righter to me)

> If we don't set pte_devmap(), then every single page that GUP
> processes needs to check if it's a ZONE_DEVICE page and also if it's
> a P2PDMA page (thus dereferencing pgmap) in order to satisfy the
> requirements of FOLL_PCI_P2PDMA.

Definately not suggesting not to set pte_devmap(), only that
VM_MIXEDMAP should not be set on VMAs that only contain struct
pages. That is an abuse of what it is intended for.

At the very least there should be a big comment above the usage
explaining that this is just working around a limitation in
finish_fault() where it cannot set the PFN_DEV|PFN_MAP bits today.

Jason


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices
  2021-09-29 21:50   ` Logan Gunthorpe
@ 2021-09-29 23:21     ` Jason Gunthorpe
  2021-09-29 23:28       ` Logan Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 23:21 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Wed, Sep 29, 2021 at 03:50:02PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote:
> > On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote:
> >> Hi,
> >>
> >> This patchset continues my work to add userspace P2PDMA access using
> >> O_DIRECT NVMe devices. My last posting[1] just included the first 13
> >> patches in this series, but the early P2PDMA cleanup and map_sg error
> >> changes from that series have been merged into v5.15-rc1. To address
> >> concerns that that series did not add any new functionality, I've added
> >> back the userspcae functionality from the original RFC[2] (but improved
> >> based on the original feedback).
> > 
> > I really think this is the best series yet, it really looks nice
> > overall. I know the sg flag was a bit of a debate at the start, but it
> > serves an undeniable purpose and the resulting standard DMA APIs 'just
> > working' is really clean.
> 
> Actually, so far, nobody has said anything negative about using the SG flag.
> 
> > There is more possible here, we could also pass the new GUP flag in the
> > ib_umem code..
> 
> Yes, that would be very useful.

You might actually prefer to do that then the bio changes to get the
infrastructur merged as it seems less "core"

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-09-29 23:05       ` Jason Gunthorpe
@ 2021-09-29 23:27         ` Logan Gunthorpe
  2021-09-29 23:35           ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 23:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni




On 2021-09-29 5:05 p.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 03:42:00PM -0600, Logan Gunthorpe wrote:
> 
>> The main reason is probably this: if we don't use VM_MIXEDMAP, then we
>> can't set pte_devmap(). 
> 
> I think that is an API limitation in the fault routines..
> 
> finish_fault() should set the pte_devmap - eg by passing the
> PFN_DEV|PFN_MAP somehow through the vma->vm_page_prot to mk_pte() or
> otherwise signaling do_set_pte() that it should set those PTE bits
> when it creates the entry.
> 
> (or there should be a vmf_* helper for this special case, but using
> the vmf->page seems righter to me)

I'm not opposed to this. Though I'm not sure what's best here.

>> If we don't set pte_devmap(), then every single page that GUP
>> processes needs to check if it's a ZONE_DEVICE page and also if it's
>> a P2PDMA page (thus dereferencing pgmap) in order to satisfy the
>> requirements of FOLL_PCI_P2PDMA.
> 
> Definately not suggesting not to set pte_devmap(), only that
> VM_MIXEDMAP should not be set on VMAs that only contain struct
> pages. That is an abuse of what it is intended for.
> 
> At the very least there should be a big comment above the usage
> explaining that this is just working around a limitation in
> finish_fault() where it cannot set the PFN_DEV|PFN_MAP bits today.

Is it? Documentation on vmf_insert_mixed() and VM_MIXEDMAP is not good
and the intention is not clear. I got the impression that mm people
wanted those interfaces used for users of pte_devmap().

device-dax uses these interfaces and as far as I can see it also only
contains struct pages (or at least  dev_dax_huge_fault() calls
pfn_to_page() on every page when VM_FAULT_NOPAGE happens).

So it would be nice to get some direction here from mm developers on
what they'd prefer.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices
  2021-09-29 23:21     ` Jason Gunthorpe
@ 2021-09-29 23:28       ` Logan Gunthorpe
  2021-09-29 23:36         ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 23:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni



On 2021-09-29 5:21 p.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 03:50:02PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote:
>>> On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote:
>>>> Hi,
>>>>
>>>> This patchset continues my work to add userspace P2PDMA access using
>>>> O_DIRECT NVMe devices. My last posting[1] just included the first 13
>>>> patches in this series, but the early P2PDMA cleanup and map_sg error
>>>> changes from that series have been merged into v5.15-rc1. To address
>>>> concerns that that series did not add any new functionality, I've added
>>>> back the userspcae functionality from the original RFC[2] (but improved
>>>> based on the original feedback).
>>>
>>> I really think this is the best series yet, it really looks nice
>>> overall. I know the sg flag was a bit of a debate at the start, but it
>>> serves an undeniable purpose and the resulting standard DMA APIs 'just
>>> working' is really clean.
>>
>> Actually, so far, nobody has said anything negative about using the SG flag.
>>
>>> There is more possible here, we could also pass the new GUP flag in the
>>> ib_umem code..
>>
>> Yes, that would be very useful.
> 
> You might actually prefer to do that then the bio changes to get the
> infrastructur merged as it seems less "core"

I'm a little bit more concerned about my patch set growing too large.
It's already at 20 patches and I think I'll need to add a couple more
based on the feedback you've already provided. So I'm leaning toward
pushing more functionality as future work.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-09-29 23:27         ` Logan Gunthorpe
@ 2021-09-29 23:35           ` Jason Gunthorpe
  2021-09-29 23:49             ` Logan Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 23:35 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Wed, Sep 29, 2021 at 05:27:22PM -0600, Logan Gunthorpe wrote:

> > finish_fault() should set the pte_devmap - eg by passing the
> > PFN_DEV|PFN_MAP somehow through the vma->vm_page_prot to mk_pte() or
> > otherwise signaling do_set_pte() that it should set those PTE bits
> > when it creates the entry.
> > 
> > (or there should be a vmf_* helper for this special case, but using
> > the vmf->page seems righter to me)
> 
> I'm not opposed to this. Though I'm not sure what's best here.
> 
> >> If we don't set pte_devmap(), then every single page that GUP
> >> processes needs to check if it's a ZONE_DEVICE page and also if it's
> >> a P2PDMA page (thus dereferencing pgmap) in order to satisfy the
> >> requirements of FOLL_PCI_P2PDMA.
> > 
> > Definately not suggesting not to set pte_devmap(), only that
> > VM_MIXEDMAP should not be set on VMAs that only contain struct
> > pages. That is an abuse of what it is intended for.
> > 
> > At the very least there should be a big comment above the usage
> > explaining that this is just working around a limitation in
> > finish_fault() where it cannot set the PFN_DEV|PFN_MAP bits today.
> 
> Is it? Documentation on vmf_insert_mixed() and VM_MIXEDMAP is not good
> and the intention is not clear. I got the impression that mm people
> wanted those interfaces used for users of pte_devmap().

I thought VM_MIXEDMAP was quite clear:

#define VM_MIXEDMAP	0x10000000	/* Can contain "struct page" and pure PFN pages */

This VMA does not include PFN pages, so it should not be tagged
VM_MIXEDMAP.

Aside from enabling the special vmf_ API, it only controls some
special behavior in vm_normal_page:

 * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
 * page" backing, however the difference is that _all_ pages with a struct
 * page (that is, those where pfn_valid is true) are refcounted and considered
 * normal pages by the VM. The disadvantage is that pages are refcounted
 * (which can be slower and simply not an option for some PFNMAP users). The
 * advantage is that we don't have to follow the strict linearity rule of
 * PFNMAP mappings in order to support COWable mappings.

Which again does not describe this case.

> device-dax uses these interfaces and as far as I can see it also only
> contains struct pages (or at least  dev_dax_huge_fault() calls
> pfn_to_page() on every page when VM_FAULT_NOPAGE happens).

hacky hacky :)

I think DAX probably did it that way for the same reason you are
doing it that way - no other choice without changing something

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices
  2021-09-29 23:28       ` Logan Gunthorpe
@ 2021-09-29 23:36         ` Jason Gunthorpe
  2021-09-29 23:52           ` Logan Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 23:36 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Wed, Sep 29, 2021 at 05:28:38PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2021-09-29 5:21 p.m., Jason Gunthorpe wrote:
> > On Wed, Sep 29, 2021 at 03:50:02PM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote:
> >>> On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote:
> >>>> Hi,
> >>>>
> >>>> This patchset continues my work to add userspace P2PDMA access using
> >>>> O_DIRECT NVMe devices. My last posting[1] just included the first 13
> >>>> patches in this series, but the early P2PDMA cleanup and map_sg error
> >>>> changes from that series have been merged into v5.15-rc1. To address
> >>>> concerns that that series did not add any new functionality, I've added
> >>>> back the userspcae functionality from the original RFC[2] (but improved
> >>>> based on the original feedback).
> >>>
> >>> I really think this is the best series yet, it really looks nice
> >>> overall. I know the sg flag was a bit of a debate at the start, but it
> >>> serves an undeniable purpose and the resulting standard DMA APIs 'just
> >>> working' is really clean.
> >>
> >> Actually, so far, nobody has said anything negative about using the SG flag.
> >>
> >>> There is more possible here, we could also pass the new GUP flag in the
> >>> ib_umem code..
> >>
> >> Yes, that would be very useful.
> > 
> > You might actually prefer to do that then the bio changes to get the
> > infrastructur merged as it seems less "core"
> 
> I'm a little bit more concerned about my patch set growing too large.
> It's already at 20 patches and I think I'll need to add a couple more
> based on the feedback you've already provided. So I'm leaning toward
> pushing more functionality as future work.

I mean you could postpone the three block related patches and use a
single ib_umem patch instead as the consumer.

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 4/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations
  2021-09-29 23:00         ` Logan Gunthorpe
@ 2021-09-29 23:40           ` Jason Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 23:40 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Wed, Sep 29, 2021 at 05:00:43PM -0600, Logan Gunthorpe wrote:
> 
> 
> 
> On 2021-09-29 4:46 p.m., Jason Gunthorpe wrote:
> > On Wed, Sep 29, 2021 at 03:30:42PM -0600, Logan Gunthorpe wrote:
> >> On 2021-09-28 4:05 p.m., Jason Gunthorpe wrote:
> >> No, that's not a correct reading of the code. Every time there is a new
> >> pagemap, this code calculates the mapping type and bus offset. If a page
> >> comes along with a different page map,f it recalculates. This just
> >> reduces the overhead so that the calculation is done only every time a
> >> page with a different pgmap comes along and not doing it for every
> >> single page.
> > 
> > Each 'struct scatterlist *sg' refers to a range of contiguous pfns
> > starting at page_to_pfn(sg_page()) and going for approx sg->length/PAGE_SIZE
> > pfns long.
> > 
> 
> Ugh, right. A bit contrived for consecutive pages to have different
> pgmaps and still be next to each other in a DMA transaction. But I guess
> it is technically possible and should be protected against.

I worry it is something a hostile userspace could cookup using mmap
and cause some kind of kernel integrity problem with.

> > @@ -470,7 +470,8 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
> >  
> >                 /* Merge contiguous pages into the last SG */
> >                 prv_len = sgt_append->prv->length;
> > -               while (n_pages && page_to_pfn(pages[0]) == paddr) {
> > +               while (n_pages && page_to_pfn(pages[0]) == paddr &&
> > +                      sg_page(sgt_append->prv)->pgmap == pages[0]->pgmap) {
> 
> I don't think it's correct to use pgmap without first checking if it is
> a zone device page. But your point is taken. I'll try to address this.

Yes

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-09-29 23:35           ` Jason Gunthorpe
@ 2021-09-29 23:49             ` Logan Gunthorpe
  2021-09-30  0:36               ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 23:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni



On 2021-09-29 5:35 p.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 05:27:22PM -0600, Logan Gunthorpe wrote:
> 
>>> finish_fault() should set the pte_devmap - eg by passing the
>>> PFN_DEV|PFN_MAP somehow through the vma->vm_page_prot to mk_pte() or
>>> otherwise signaling do_set_pte() that it should set those PTE bits
>>> when it creates the entry.
>>>
>>> (or there should be a vmf_* helper for this special case, but using
>>> the vmf->page seems righter to me)
>>
>> I'm not opposed to this. Though I'm not sure what's best here.
>>
>>>> If we don't set pte_devmap(), then every single page that GUP
>>>> processes needs to check if it's a ZONE_DEVICE page and also if it's
>>>> a P2PDMA page (thus dereferencing pgmap) in order to satisfy the
>>>> requirements of FOLL_PCI_P2PDMA.
>>>
>>> Definately not suggesting not to set pte_devmap(), only that
>>> VM_MIXEDMAP should not be set on VMAs that only contain struct
>>> pages. That is an abuse of what it is intended for.
>>>
>>> At the very least there should be a big comment above the usage
>>> explaining that this is just working around a limitation in
>>> finish_fault() where it cannot set the PFN_DEV|PFN_MAP bits today.
>>
>> Is it? Documentation on vmf_insert_mixed() and VM_MIXEDMAP is not good
>> and the intention is not clear. I got the impression that mm people
>> wanted those interfaces used for users of pte_devmap().
> 
> I thought VM_MIXEDMAP was quite clear:
> 
> #define VM_MIXEDMAP	0x10000000	/* Can contain "struct page" and pure PFN pages */
> 
> This VMA does not include PFN pages, so it should not be tagged
> VM_MIXEDMAP.
> 
> Aside from enabling the special vmf_ API, it only controls some
> special behavior in vm_normal_page:
> 
>  * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
>  * page" backing, however the difference is that _all_ pages with a struct
>  * page (that is, those where pfn_valid is true) are refcounted and considered
>  * normal pages by the VM. The disadvantage is that pages are refcounted
>  * (which can be slower and simply not an option for some PFNMAP users). The
>  * advantage is that we don't have to follow the strict linearity rule of
>  * PFNMAP mappings in order to support COWable mappings.
> 
> Which again does not describe this case.

Some of this seems out of date. Pretty sure the pages are not refcounted
with vmf_insert_mixed() and vmf_insert_mixed() is currently the only way
to use VM_MIXEDMAP mappings.

>> device-dax uses these interfaces and as far as I can see it also only
>> contains struct pages (or at least  dev_dax_huge_fault() calls
>> pfn_to_page() on every page when VM_FAULT_NOPAGE happens).
> 
> hacky hacky :)
> 
> I think DAX probably did it that way for the same reason you are
> doing it that way - no other choice without changing something

Sure but if you look at other vmf_insert_mixed() (of which there are
few) you see similar patterns. Seems more like it was documented with
one thing in mind but then used in a completely different manner. Which
is why I suggested the documentation was not so good.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices
  2021-09-29 23:36         ` Jason Gunthorpe
@ 2021-09-29 23:52           ` Logan Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-29 23:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni



On 2021-09-29 5:36 p.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 05:28:38PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2021-09-29 5:21 p.m., Jason Gunthorpe wrote:
>>> On Wed, Sep 29, 2021 at 03:50:02PM -0600, Logan Gunthorpe wrote:
>>>>
>>>>
>>>> On 2021-09-28 2:02 p.m., Jason Gunthorpe wrote:
>>>>> On Thu, Sep 16, 2021 at 05:40:40PM -0600, Logan Gunthorpe wrote:
>>>>>> Hi,
>>>>>>
>>>>>> This patchset continues my work to add userspace P2PDMA access using
>>>>>> O_DIRECT NVMe devices. My last posting[1] just included the first 13
>>>>>> patches in this series, but the early P2PDMA cleanup and map_sg error
>>>>>> changes from that series have been merged into v5.15-rc1. To address
>>>>>> concerns that that series did not add any new functionality, I've added
>>>>>> back the userspcae functionality from the original RFC[2] (but improved
>>>>>> based on the original feedback).
>>>>>
>>>>> I really think this is the best series yet, it really looks nice
>>>>> overall. I know the sg flag was a bit of a debate at the start, but it
>>>>> serves an undeniable purpose and the resulting standard DMA APIs 'just
>>>>> working' is really clean.
>>>>
>>>> Actually, so far, nobody has said anything negative about using the SG flag.
>>>>
>>>>> There is more possible here, we could also pass the new GUP flag in the
>>>>> ib_umem code..
>>>>
>>>> Yes, that would be very useful.
>>>
>>> You might actually prefer to do that then the bio changes to get the
>>> infrastructur merged as it seems less "core"
>>
>> I'm a little bit more concerned about my patch set growing too large.
>> It's already at 20 patches and I think I'll need to add a couple more
>> based on the feedback you've already provided. So I'm leaning toward
>> pushing more functionality as future work.
> 
> I mean you could postpone the three block related patches and use a
> single ib_umem patch instead as the consumer.

I think that's not a very compelling use case given the only provider of
these VMAs is an NVMe block device. My patch set enables a real world
use (copying data between NVMe devices P2P through the CMB with O_DIRECT).

Being able to read or write a CMB with RDMA and only RDMA is not very
compelling.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-09-29 23:49             ` Logan Gunthorpe
@ 2021-09-30  0:36               ` Jason Gunthorpe
  2021-10-01 13:48                 ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-09-30  0:36 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On Wed, Sep 29, 2021 at 05:49:36PM -0600, Logan Gunthorpe wrote:

> Some of this seems out of date. Pretty sure the pages are not refcounted
> with vmf_insert_mixed() and vmf_insert_mixed() is currently the only way
> to use VM_MIXEDMAP mappings.

Hum.

vmf_insert_mixed() boils down to insert_pfn() which always sets the
special bit, so vm_normal_page() returns NULL and thus the pages are
not freed during zap.

So, if the pages are always special and not refcounted all the docs
seem really out of date - or rather they describe the situation
without the special bit, I think.

Why would DAX want to do this in the first place?? This means the
address space zap is much more important that just speeding up
destruction, it is essential for correctness since the PTEs are not
holding refcounts naturally...

Sigh.

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  2021-09-16 23:40 ` [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
  2021-09-28 18:32   ` Jason Gunthorpe
@ 2021-09-30  4:47   ` Chaitanya Kulkarni
  2021-09-30 16:49     ` Logan Gunthorpe
  2021-09-30  4:57   ` Chaitanya Kulkarni
  2 siblings, 1 reply; 87+ messages in thread
From: Chaitanya Kulkarni @ 2021-09-30  4:47 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

Logan,

> +/*
> + * bit 2 is the third free bit in the page_link on 64bit systems which
> + * is used by dma_unmap_sg() to determine if the dma_address is a PCI
> + * bus address when doing P2PDMA.
> + * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
> + */
> +
> +#ifdef CONFIG_PCI_P2PDMA
> +#define SG_DMA_PCI_P2PDMA      0x04UL
> +#else
> +#define SG_DMA_PCI_P2PDMA      0x00UL
> +#endif
> +
> +#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_PCI_P2PDMA)
> +

You are doing two things in one patch :-
1. Introducing a new macro to replace the current macros.
2. Adding a new member to those macros.

shouldn't this be split into two patches where you introduce a
macro SG_PAGE_LINK_MASK (SG_CHAIN | SG_END) in prep patch and
update the SF_PAGE_LINK_MASK with SG_DMA_PCI_P2PDMA with related
code?

OR

Is there a reason why it is not split ?

-ck

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  2021-09-16 23:40 ` [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
  2021-09-28 18:32   ` Jason Gunthorpe
  2021-09-30  4:47   ` Chaitanya Kulkarni
@ 2021-09-30  4:57   ` Chaitanya Kulkarni
  2 siblings, 0 replies; 87+ messages in thread
From: Chaitanya Kulkarni @ 2021-09-30  4:57 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

> +/**
> + * sg_unmark_pci_p2pdma - Unmark the scatterlist entry for PCI p2pdma
> + * @sg:                 SG entryScatterlist
> + *
> + * Description:
> + *   Clears the PCI P2PDMA mark
> + **/
nit:- Probably want to add '.' above.
> +static inline void sg_dma_unmark_pci_p2pdma(struct scatterlist *sg)
> +{
> +       sg->page_link &= ~SG_DMA_PCI_P2PDMA;
> +}
> +
>   /**
>    * sg_phys - Return physical address of an sg entry
>    * @sg:             SG entry
> --
> 2.30.2
> 

either ways with or without split, looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA
  2021-09-16 23:40 ` [PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA Logan Gunthorpe
@ 2021-09-30  5:06   ` Chaitanya Kulkarni
  2021-09-30 16:51     ` Logan Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Chaitanya Kulkarni @ 2021-09-30  5:06 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

Logan,

> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 7efb31b87f37..916750a54f60 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3771,7 +3771,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
>                  blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
> 
>          blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
> -       if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
> +       if (ctrl->ops->supports_pci_p2pdma &&
> +           ctrl->ops->supports_pci_p2pdma(ctrl))
>                  blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
> 
>          ns->ctrl = ctrl;
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 9871c0c9374c..fb9bfc52a6d7 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -477,7 +477,6 @@ struct nvme_ctrl_ops {
>          unsigned int flags;
>   #define NVME_F_FABRICS                 (1 << 0)
>   #define NVME_F_METADATA_SUPPORTED      (1 << 1)
> -#define NVME_F_PCI_P2PDMA              (1 << 2)
>          int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
>          int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
>          int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
> @@ -485,6 +484,7 @@ struct nvme_ctrl_ops {
>          void (*submit_async_event)(struct nvme_ctrl *ctrl);
>          void (*delete_ctrl)(struct nvme_ctrl *ctrl);
>          int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
> +       bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
>   };
> 

Is this new ops only needed for the PCIe transport ? or do you have 
following patches to use this op for the other transports ?

If it is only needed for the PCIe then we need to find a way to
not add this somehow...


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  2021-09-30  4:47   ` Chaitanya Kulkarni
@ 2021-09-30 16:49     ` Logan Gunthorpe
  0 siblings, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-30 16:49 UTC (permalink / raw)
  To: Chaitanya Kulkarni, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni



On 2021-09-29 10:47 p.m., Chaitanya Kulkarni wrote:
> Logan,
> 
>> +/*
>> + * bit 2 is the third free bit in the page_link on 64bit systems which
>> + * is used by dma_unmap_sg() to determine if the dma_address is a PCI
>> + * bus address when doing P2PDMA.
>> + * Note: CONFIG_PCI_P2PDMA depends on CONFIG_64BIT because of this.
>> + */
>> +
>> +#ifdef CONFIG_PCI_P2PDMA
>> +#define SG_DMA_PCI_P2PDMA      0x04UL
>> +#else
>> +#define SG_DMA_PCI_P2PDMA      0x00UL
>> +#endif
>> +
>> +#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_PCI_P2PDMA)
>> +
> 
> You are doing two things in one patch :-
> 1. Introducing a new macro to replace the current macros.
> 2. Adding a new member to those macros.
> 
> shouldn't this be split into two patches where you introduce a
> macro SG_PAGE_LINK_MASK (SG_CHAIN | SG_END) in prep patch and
> update the SF_PAGE_LINK_MASK with SG_DMA_PCI_P2PDMA with related
> code?
> 

Ok, will split. I'll also add the static inline cleanup Jason suggested
in the first patch.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA
  2021-09-30  5:06   ` Chaitanya Kulkarni
@ 2021-09-30 16:51     ` Logan Gunthorpe
  2021-09-30 17:19       ` Chaitanya Kulkarni
  0 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-09-30 16:51 UTC (permalink / raw)
  To: Chaitanya Kulkarni, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni



On 2021-09-29 11:06 p.m., Chaitanya Kulkarni wrote:
> Logan,
> 
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 7efb31b87f37..916750a54f60 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3771,7 +3771,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid,
>>                  blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
>>
>>          blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
>> -       if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
>> +       if (ctrl->ops->supports_pci_p2pdma &&
>> +           ctrl->ops->supports_pci_p2pdma(ctrl))
>>                  blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
>>
>>          ns->ctrl = ctrl;
>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>> index 9871c0c9374c..fb9bfc52a6d7 100644
>> --- a/drivers/nvme/host/nvme.h
>> +++ b/drivers/nvme/host/nvme.h
>> @@ -477,7 +477,6 @@ struct nvme_ctrl_ops {
>>          unsigned int flags;
>>   #define NVME_F_FABRICS                 (1 << 0)
>>   #define NVME_F_METADATA_SUPPORTED      (1 << 1)
>> -#define NVME_F_PCI_P2PDMA              (1 << 2)
>>          int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
>>          int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
>>          int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
>> @@ -485,6 +484,7 @@ struct nvme_ctrl_ops {
>>          void (*submit_async_event)(struct nvme_ctrl *ctrl);
>>          void (*delete_ctrl)(struct nvme_ctrl *ctrl);
>>          int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
>> +       bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
>>   };
>>
> 
> Is this new ops only needed for the PCIe transport ? or do you have 
> following patches to use this op for the other transports ?

No, I don't think this will make sense for transports that are not based
on PCI devices.

> If it is only needed for the PCIe then we need to find a way to
> not add this somehow...

I don't see how we can do that. The core code needs to know whether the
transport supports this and must have an operation to query it.

Logan


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA
  2021-09-30 16:51     ` Logan Gunthorpe
@ 2021-09-30 17:19       ` Chaitanya Kulkarni
  0 siblings, 0 replies; 87+ messages in thread
From: Chaitanya Kulkarni @ 2021-09-30 17:19 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni


>>
>> Is this new ops only needed for the PCIe transport ? or do you have
>> following patches to use this op for the other transports ?
> 
> No, I don't think this will make sense for transports that are not based
> on PCI devices.
> 
>> If it is only needed for the PCIe then we need to find a way to
>> not add this somehow...
> 
> I don't see how we can do that. The core code needs to know whether the
> transport supports this and must have an operation to query it.
> 

Okay.

> Logan
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-09-30  0:36               ` Jason Gunthorpe
@ 2021-10-01 13:48                 ` Jason Gunthorpe
  2021-10-01 17:01                   ` Logan Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-10-01 13:48 UTC (permalink / raw)
  To: Logan Gunthorpe, Alistair Popple, Felix Kuehling,
	Christoph Hellwig, Dan Williams
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christian König, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin,
	Bjorn Helgaas, Ira Weiny, Robin Murphy, Martin Oliveira,
	Chaitanya Kulkarni

On Wed, Sep 29, 2021 at 09:36:52PM -0300, Jason Gunthorpe wrote:

> Why would DAX want to do this in the first place?? This means the
> address space zap is much more important that just speeding up
> destruction, it is essential for correctness since the PTEs are not
> holding refcounts naturally...

It is not really for this series to fix, but I think the whole thing
is probably racy once you start allowing pte_special pages to be
accessed by GUP.

If we look at unmapping the PTE relative to GUP fast the important
sequence is how the TLB flushing doesn't decrement the page refcount
until after it knows any concurrent GUP fast is completed. This is
arch specific, eg it could be done async through a call_rcu handler.

This ensures that pages can't cross back into the free pool and be
reallocated until we know for certain that nobody is walking the PTEs
and could potentially take an additional reference on it. The scheme
cannot rely on the page refcount being 0 because oce it goes into the
free pool it could be immeidately reallocated back to a non-zero
refcount.

A DAX user that simply does an address space invalidation doesn't
sequence itself with any of this mechanism. So we can race with a
thread doing GUP fast and another thread re-cycling the page into
another use - creating a leakage of the page from one security context
to another.

This seems to be made worse for the pgmap stuff due to the wonky
refcount usage - at least if the refcount had dropped to zero gup fast
would be blocked for a time, but even that doesn't happen.

In short, I think using pg special for anything that can be returned
by gup fast (and maybe even gup!) is racy/wrong. We must have the
normal refcount mechanism work for correctness of the recycling flow.

I don't know why DAX did this, I think we should be talking about
udoing it all of it, not just the wonky refcounting Alistair and Felix
are working on, but also the use of MIXEDMAP and pte special for
struct page backed memory.

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-01 13:48                 ` Jason Gunthorpe
@ 2021-10-01 17:01                   ` Logan Gunthorpe
  2021-10-01 17:45                     ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-10-01 17:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Alistair Popple, Felix Kuehling,
	Christoph Hellwig, Dan Williams
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christian König, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin,
	Bjorn Helgaas, Ira Weiny, Robin Murphy, Martin Oliveira,
	Chaitanya Kulkarni




On 2021-10-01 7:48 a.m., Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 09:36:52PM -0300, Jason Gunthorpe wrote:
> 
>> Why would DAX want to do this in the first place?? This means the
>> address space zap is much more important that just speeding up
>> destruction, it is essential for correctness since the PTEs are not
>> holding refcounts naturally...
> 
> It is not really for this series to fix, but I think the whole thing
> is probably racy once you start allowing pte_special pages to be
> accessed by GUP.
> 
> If we look at unmapping the PTE relative to GUP fast the important
> sequence is how the TLB flushing doesn't decrement the page refcount
> until after it knows any concurrent GUP fast is completed. This is
> arch specific, eg it could be done async through a call_rcu handler.
> 
> This ensures that pages can't cross back into the free pool and be
> reallocated until we know for certain that nobody is walking the PTEs
> and could potentially take an additional reference on it. The scheme
> cannot rely on the page refcount being 0 because oce it goes into the
> free pool it could be immeidately reallocated back to a non-zero
> refcount.
> 
> A DAX user that simply does an address space invalidation doesn't
> sequence itself with any of this mechanism. So we can race with a
> thread doing GUP fast and another thread re-cycling the page into
> another use - creating a leakage of the page from one security context
> to another.
> 
> This seems to be made worse for the pgmap stuff due to the wonky
> refcount usage - at least if the refcount had dropped to zero gup fast
> would be blocked for a time, but even that doesn't happen.
> 
> In short, I think using pg special for anything that can be returned
> by gup fast (and maybe even gup!) is racy/wrong. We must have the
> normal refcount mechanism work for correctness of the recycling flow.

I'm not quite following all of this. I'm not entirely sure how fs/dax
works in this regard, but for device-dax (and similarly p2pdma) it
doesn't seem as bad as you say.

In device-dax, the refcount is only used to prevent the device, and
therefore the pages, from going away on device unbind. Pages cannot be
recycled, as you say, as they are mapped linearly within the device. The
address space invalidation is done only when the device is unbound.
Before the invalidation, an active flag is cleared to ensure no new
mappings can be created while the unmap is proceeding.
unmap_mapping_range() should sequence itself with the TLB flush and
GUP-fast using the same mechanism it does for regular pages. As far as I
can see, by the time unmap_mapping_range() returns, we should be
confident that there are no pages left in any mapping (seeing no new
pages could be added since before the call). Then before finishing the
unbind, device-dax decrements the refcount of all pages and then waits
for the refcount of all pages to go to zero. Thus, any pages that
successfully were got with GUP, during or before unmap_mapping_range
should hold a reference and once all those references are returned,
unbind can finish.

P2PDMA follows this pattern, except pages are not mapped linearly and
are returned to the genalloc when their refcount falls to 1. This only
happens after a VMA is closed which should imply the PTEs have already
been unlinked from the pages. And the same situation occurs on unbind
with a flag preventing new mappings from being created before
unmap_mapping_range(), etc.

Not to say that all this couldn't use a big conceptual cleanup. A
similar question exists with the single find_special_page() user
(xen/gntdev) and it's definitely not clear what the differences are
between the find_special_page() and vmf_insert_mixed() techniques and
when one should be used over the other. Or could they both be merged to
use the same technique?

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-01 17:01                   ` Logan Gunthorpe
@ 2021-10-01 17:45                     ` Jason Gunthorpe
  2021-10-01 20:13                       ` Logan Gunthorpe
  2021-10-04  6:58                       ` Christian König
  0 siblings, 2 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2021-10-01 17:45 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Alistair Popple, Felix Kuehling, Christoph Hellwig, Dan Williams,
	linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christian König, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin,
	Bjorn Helgaas, Ira Weiny, Robin Murphy, Martin Oliveira,
	Chaitanya Kulkarni

On Fri, Oct 01, 2021 at 11:01:49AM -0600, Logan Gunthorpe wrote:

> In device-dax, the refcount is only used to prevent the device, and
> therefore the pages, from going away on device unbind. Pages cannot be
> recycled, as you say, as they are mapped linearly within the device. The
> address space invalidation is done only when the device is unbound.

By address space invalidation I mean invalidation of the VMA that is
pointing to those pages.

device-dax may not have a issue with use-after-VMA-invalidation by
it's very nature since every PFN always points to the same
thing. fsdax and this p2p stuff are different though.

> Before the invalidation, an active flag is cleared to ensure no new
> mappings can be created while the unmap is proceeding.
> unmap_mapping_range() should sequence itself with the TLB flush and

AFIAK unmap_mapping_range() kicks off the TLB flush and then
returns. It doesn't always wait for the flush to fully finish. Ie some
cases use RCU to lock the page table against GUP fast and so the
put_page() doesn't happen until the call_rcu completes - after a grace
period. The unmap_mapping_range() does not wait for grace periods.

This is why for normal memory the put_page is done after the TLB flush
completes, not when unmap_mapping_range() finishes. This ensures that
before the refcount reaches 0 no concurrent GUP fast can still observe
the old PTEs.

> GUP-fast using the same mechanism it does for regular pages. As far as I
> can see, by the time unmap_mapping_range() returns, we should be
> confident that there are no pages left in any mapping (seeing no new
> pages could be added since before the call). 

When viewed under the page table locks this is true, but the 'fast'
walkers like gup_fast and hmm_range_fault can continue to be working
on old data in the ptes because they don't take the page table locks.

They interact with unmap_mapping_range() via the IPI/rcu (gup fast) or
mmu notifier sequence count (hmm_range_fault)

> P2PDMA follows this pattern, except pages are not mapped linearly and
> are returned to the genalloc when their refcount falls to 1. This only
> happens after a VMA is closed which should imply the PTEs have already
> been unlinked from the pages. 

And here is the problem, since the genalloc is being used we now care
that a page should not continue to be accessed by userspace after it
has be placed back into the genalloc. I suppose fsdax has the same
basic issue too.

> Not to say that all this couldn't use a big conceptual cleanup. A
> similar question exists with the single find_special_page() user
> (xen/gntdev) and it's definitely not clear what the differences are
> between the find_special_page() and vmf_insert_mixed() techniques and
> when one should be used over the other. Or could they both be merged to
> use the same technique?

Oh that gntdev stuff is just nonsense. IIRC is trying to delegate
control over a PTE entry itself to the hypervisor.

		/*
		 * gntdev takes the address of the PTE in find_grant_ptes() and
		 * passes it to the hypervisor in gntdev_map_grant_pages(). The
		 * purpose of the notifier is to prevent the hypervisor pointer
		 * to the PTE from going stale.
		 *
		 * Since this vma's mappings can't be touched without the
		 * mmap_lock, and we are holding it now, there is no need for
		 * the notifier_range locking pattern.

I vaugely recall it stuffs in a normal page then has the hypervisor
overwrite the PTE. When it comes time to free the PTE it recovers the
normal page via the 'find_special_page' hack and frees it. Somehow the
hypervisor is also using the normal page for something.

It is all very strange and one shouldn't think about it :|

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-01 17:45                     ` Jason Gunthorpe
@ 2021-10-01 20:13                       ` Logan Gunthorpe
  2021-10-01 22:14                         ` Jason Gunthorpe
  2021-10-04  6:58                       ` Christian König
  1 sibling, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-10-01 20:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, Felix Kuehling, Christoph Hellwig, Dan Williams,
	linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christian König, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin,
	Bjorn Helgaas, Ira Weiny, Robin Murphy, Martin Oliveira,
	Chaitanya Kulkarni



On 2021-10-01 11:45 a.m., Jason Gunthorpe wrote:
>> Before the invalidation, an active flag is cleared to ensure no new
>> mappings can be created while the unmap is proceeding.
>> unmap_mapping_range() should sequence itself with the TLB flush and
> 
> AFIAK unmap_mapping_range() kicks off the TLB flush and then
> returns. It doesn't always wait for the flush to fully finish. Ie some
> cases use RCU to lock the page table against GUP fast and so the
> put_page() doesn't happen until the call_rcu completes - after a grace
> period. The unmap_mapping_range() does not wait for grace periods.

Admittedly, the tlb flush code isn't the easiest code to understand.
But, yes it seems at least on some arches the pages are freed by
call_rcu(). But can't this be fixed easily by adding a synchronize_rcu()
call after calling unmap_mapping_range()? Certainly after a
synchronize_rcu(), the TLB has been flushed and it is safe to free those
pages.

>> P2PDMA follows this pattern, except pages are not mapped linearly and
>> are returned to the genalloc when their refcount falls to 1. This only
>> happens after a VMA is closed which should imply the PTEs have already
>> been unlinked from the pages. 
> 
> And here is the problem, since the genalloc is being used we now care
> that a page should not continue to be accessed by userspace after it
> has be placed back into the genalloc. I suppose fsdax has the same
> basic issue too.

Ok, similar question. Then if we call synchronize_rcu() in vm_close(),
before the put_page() calls which return the pages to the genalloc,
would that not guarantee the TLBs have been appropriately flushed?


>> Not to say that all this couldn't use a big conceptual cleanup. A
>> similar question exists with the single find_special_page() user
>> (xen/gntdev) and it's definitely not clear what the differences are
>> between the find_special_page() and vmf_insert_mixed() techniques and
>> when one should be used over the other. Or could they both be merged to
>> use the same technique?
> 
> Oh that gntdev stuff is just nonsense. IIRC is trying to delegate
> control over a PTE entry itself to the hypervisor.
> 
> 		/*
> 		 * gntdev takes the address of the PTE in find_grant_ptes() and
> 		 * passes it to the hypervisor in gntdev_map_grant_pages(). The
> 		 * purpose of the notifier is to prevent the hypervisor pointer
> 		 * to the PTE from going stale.
> 		 *
> 		 * Since this vma's mappings can't be touched without the
> 		 * mmap_lock, and we are holding it now, there is no need for
> 		 * the notifier_range locking pattern.
> 
> I vaugely recall it stuffs in a normal page then has the hypervisor
> overwrite the PTE. When it comes time to free the PTE it recovers the
> normal page via the 'find_special_page' hack and frees it. Somehow the
> hypervisor is also using the normal page for something.
> 
> It is all very strange and one shouldn't think about it :|

Found this from an old commit message which seems to be a better
explanation, though I still don't fully understand it:

   In a Xen PV guest, the PTEs contain MFNs so get_user_pages() (for
   example) must do an MFN to PFN (M2P) lookup before it can get the
   page.  For foreign pages (those owned by another guest) the M2P
   lookup returns the PFN as seen by the foreign guest (which would be
   completely the wrong page for the local guest).

   This cannot be fixed up improving the M2P lookup since one MFN may be
   mapped onto two or more pages so getting the right page is impossible
   given just the MFN.

Yes, all very strange.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-01 20:13                       ` Logan Gunthorpe
@ 2021-10-01 22:14                         ` Jason Gunthorpe
  2021-10-01 22:22                           ` Logan Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-10-01 22:14 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Alistair Popple, Felix Kuehling, Christoph Hellwig, Dan Williams,
	linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christian König, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin,
	Bjorn Helgaas, Ira Weiny, Robin Murphy, Martin Oliveira,
	Chaitanya Kulkarni

On Fri, Oct 01, 2021 at 02:13:14PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2021-10-01 11:45 a.m., Jason Gunthorpe wrote:
> >> Before the invalidation, an active flag is cleared to ensure no new
> >> mappings can be created while the unmap is proceeding.
> >> unmap_mapping_range() should sequence itself with the TLB flush and
> > 
> > AFIAK unmap_mapping_range() kicks off the TLB flush and then
> > returns. It doesn't always wait for the flush to fully finish. Ie some
> > cases use RCU to lock the page table against GUP fast and so the
> > put_page() doesn't happen until the call_rcu completes - after a grace
> > period. The unmap_mapping_range() does not wait for grace periods.
> 
> Admittedly, the tlb flush code isn't the easiest code to understand.
> But, yes it seems at least on some arches the pages are freed by
> call_rcu(). But can't this be fixed easily by adding a synchronize_rcu()
> call after calling unmap_mapping_range()? Certainly after a
> synchronize_rcu(), the TLB has been flushed and it is safe to free those
> pages.

It would close this issue, however synchronize_rcu() is very slow
(think > 1second) in some cases and thus cannot be inserted here.

I'm also not completely sure that rcu is the only case, I don't know
how every arch handles its gather structure.. I have a feeling the
general intention was for this to be asynchronous

My preferences are to either remove devmap from gup_fast, or fix it to
not use special pages - the latter being obviously better.

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-01 22:14                         ` Jason Gunthorpe
@ 2021-10-01 22:22                           ` Logan Gunthorpe
  2021-10-01 22:46                             ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Logan Gunthorpe @ 2021-10-01 22:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, Felix Kuehling, Christoph Hellwig, Dan Williams,
	linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christian König, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin,
	Bjorn Helgaas, Ira Weiny, Robin Murphy, Martin Oliveira,
	Chaitanya Kulkarni




On 2021-10-01 4:14 p.m., Jason Gunthorpe wrote:
> On Fri, Oct 01, 2021 at 02:13:14PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2021-10-01 11:45 a.m., Jason Gunthorpe wrote:
>>>> Before the invalidation, an active flag is cleared to ensure no new
>>>> mappings can be created while the unmap is proceeding.
>>>> unmap_mapping_range() should sequence itself with the TLB flush and
>>>
>>> AFIAK unmap_mapping_range() kicks off the TLB flush and then
>>> returns. It doesn't always wait for the flush to fully finish. Ie some
>>> cases use RCU to lock the page table against GUP fast and so the
>>> put_page() doesn't happen until the call_rcu completes - after a grace
>>> period. The unmap_mapping_range() does not wait for grace periods.
>>
>> Admittedly, the tlb flush code isn't the easiest code to understand.
>> But, yes it seems at least on some arches the pages are freed by
>> call_rcu(). But can't this be fixed easily by adding a synchronize_rcu()
>> call after calling unmap_mapping_range()? Certainly after a
>> synchronize_rcu(), the TLB has been flushed and it is safe to free those
>> pages.
> 
> It would close this issue, however synchronize_rcu() is very slow
> (think > 1second) in some cases and thus cannot be inserted here.

It shouldn't be *that* slow, at least not the vast majority of the
time... it seems a bit unreasonable that a CPU wouldn't schedule for
more than a second. But these aren't fast paths and synchronize_rcu()
already gets called in the unbind path for p2pdma a of couple times. I'm
sure it would also be fine to slow down the vma_close() path as well.

> I'm also not completely sure that rcu is the only case, I don't know
> how every arch handles its gather structure.. I have a feeling the
> general intention was for this to be asynchronous

Yeah, this is not clear to me either.

> My preferences are to either remove devmap from gup_fast, or fix it to
> not use special pages - the latter being obviously better.

Yeah, I rather expect DAX users want the optimization provided by
gup_fast. I don't think P2PDMA users would be happy about being stuck
with slow gup either.

Loga

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-01 22:22                           ` Logan Gunthorpe
@ 2021-10-01 22:46                             ` Jason Gunthorpe
  2021-10-01 23:27                               ` John Hubbard
  2021-10-01 23:34                               ` Logan Gunthorpe
  0 siblings, 2 replies; 87+ messages in thread
From: Jason Gunthorpe @ 2021-10-01 22:46 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Alistair Popple, Felix Kuehling, Christoph Hellwig, Dan Williams,
	linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christian König, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin,
	Bjorn Helgaas, Ira Weiny, Robin Murphy, Martin Oliveira,
	Chaitanya Kulkarni

On Fri, Oct 01, 2021 at 04:22:28PM -0600, Logan Gunthorpe wrote:

> > It would close this issue, however synchronize_rcu() is very slow
> > (think > 1second) in some cases and thus cannot be inserted here.
> 
> It shouldn't be *that* slow, at least not the vast majority of the
> time... it seems a bit unreasonable that a CPU wouldn't schedule for
> more than a second. 

I've seen bug reports on exactly this, it is well known. Loaded
big multi-cpu systems have high delays here, for whatever reason.

> But these aren't fast paths and synchronize_rcu() already gets
> called in the unbind path for p2pdma a of couple times. I'm sure it
> would also be fine to slow down the vma_close() path as well.

vma_close is done in a loop destroying vma's and if each synchronize
costs > 1s it can take forever to close a process. We had to kill a
similar use of synchronize_rcu in RDMA because users were complaining
of > 40s process exit times.

The driver unload path is fine to be slow, and is probably done on an
unloaded system where synchronize_rcu is not so bad

Anyway, it is not really something for this series to fix, just
something we should all be aware of and probably ought to get fixed
before we do much more with ZONE_DEVICE pages

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-01 22:46                             ` Jason Gunthorpe
@ 2021-10-01 23:27                               ` John Hubbard
  2021-10-01 23:34                               ` Logan Gunthorpe
  1 sibling, 0 replies; 87+ messages in thread
From: John Hubbard @ 2021-10-01 23:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: Alistair Popple, Felix Kuehling, Christoph Hellwig, Dan Williams,
	linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christian König, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Jakowski Andrzej, Minturn Dave B,
	Jason Ekstrand, Dave Hansen, Xiong Jianxin, Bjorn Helgaas,
	Ira Weiny, Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

On 10/1/21 15:46, Jason Gunthorpe wrote:
> On Fri, Oct 01, 2021 at 04:22:28PM -0600, Logan Gunthorpe wrote:
> 
>>> It would close this issue, however synchronize_rcu() is very slow
>>> (think > 1second) in some cases and thus cannot be inserted here.
>>
>> It shouldn't be *that* slow, at least not the vast majority of the
>> time... it seems a bit unreasonable that a CPU wouldn't schedule for
>> more than a second.
> 
> I've seen bug reports on exactly this, it is well known. Loaded
> big multi-cpu systems have high delays here, for whatever reason.
> 

So have I. One reason is that synchronize_rcu() doesn't merely wait
for a context switch on each CPU--it also waits for callbacks (such as
those set up by call_rcu(), if I understand correctly) to run.

These can really add up to something quite substantial. In fact, I don't
think there is an upper limit on the running times, anywhere.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-01 22:46                             ` Jason Gunthorpe
  2021-10-01 23:27                               ` John Hubbard
@ 2021-10-01 23:34                               ` Logan Gunthorpe
  1 sibling, 0 replies; 87+ messages in thread
From: Logan Gunthorpe @ 2021-10-01 23:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, Felix Kuehling, Christoph Hellwig, Dan Williams,
	linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christian König, John Hubbard,
	Don Dutile, Matthew Wilcox, Daniel Vetter, Jakowski Andrzej,
	Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin,
	Bjorn Helgaas, Ira Weiny, Robin Murphy, Martin Oliveira,
	Chaitanya Kulkarni




On 2021-10-01 4:46 p.m., Jason Gunthorpe wrote:
> On Fri, Oct 01, 2021 at 04:22:28PM -0600, Logan Gunthorpe wrote:
> 
>>> It would close this issue, however synchronize_rcu() is very slow
>>> (think > 1second) in some cases and thus cannot be inserted here.
>>
>> It shouldn't be *that* slow, at least not the vast majority of the
>> time... it seems a bit unreasonable that a CPU wouldn't schedule for
>> more than a second. 
> 
> I've seen bug reports on exactly this, it is well known. Loaded
> big multi-cpu systems have high delays here, for whatever reason.
> 
>> But these aren't fast paths and synchronize_rcu() already gets
>> called in the unbind path for p2pdma a of couple times. I'm sure it
>> would also be fine to slow down the vma_close() path as well.
> 
> vma_close is done in a loop destroying vma's and if each synchronize
> costs > 1s it can take forever to close a process. We had to kill a
> similar use of synchronize_rcu in RDMA because users were complaining
> of > 40s process exit times.

Ah, fair. This adds a bit of complexity, but we could do a call_rcu() in
 vma_close to do the page frees.

Logan

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-01 17:45                     ` Jason Gunthorpe
  2021-10-01 20:13                       ` Logan Gunthorpe
@ 2021-10-04  6:58                       ` Christian König
  2021-10-04 13:11                         ` Jason Gunthorpe
  1 sibling, 1 reply; 87+ messages in thread
From: Christian König @ 2021-10-04  6:58 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: Alistair Popple, Felix Kuehling, Christoph Hellwig, Dan Williams,
	linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

I'm not following this discussion to closely, but try to look into it 
from time to time.

Am 01.10.21 um 19:45 schrieb Jason Gunthorpe:
> On Fri, Oct 01, 2021 at 11:01:49AM -0600, Logan Gunthorpe wrote:
>
>> In device-dax, the refcount is only used to prevent the device, and
>> therefore the pages, from going away on device unbind. Pages cannot be
>> recycled, as you say, as they are mapped linearly within the device. The
>> address space invalidation is done only when the device is unbound.
> By address space invalidation I mean invalidation of the VMA that is
> pointing to those pages.
>
> device-dax may not have a issue with use-after-VMA-invalidation by
> it's very nature since every PFN always points to the same
> thing. fsdax and this p2p stuff are different though.
>
>> Before the invalidation, an active flag is cleared to ensure no new
>> mappings can be created while the unmap is proceeding.
>> unmap_mapping_range() should sequence itself with the TLB flush and
> AFIAK unmap_mapping_range() kicks off the TLB flush and then
> returns. It doesn't always wait for the flush to fully finish. Ie some
> cases use RCU to lock the page table against GUP fast and so the
> put_page() doesn't happen until the call_rcu completes - after a grace
> period. The unmap_mapping_range() does not wait for grace periods.

Wow, wait a second. That is quite a boomer. At least in all GEM/TTM 
based graphics drivers that could potentially cause a lot of trouble.

I've just double checked and we certainly have the assumption that when 
unmap_mapping_range() returns the pte is gone and the TLB flush 
completed in quite a number of places.

Do you have more information when and why that can happen?

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-04  6:58                       ` Christian König
@ 2021-10-04 13:11                         ` Jason Gunthorpe
  2021-10-04 13:22                           ` Christian König
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-10-04 13:11 UTC (permalink / raw)
  To: Christian König
  Cc: Logan Gunthorpe, Alistair Popple, Felix Kuehling,
	Christoph Hellwig, Dan Williams, linux-kernel, linux-nvme,
	linux-block, linux-pci, linux-mm, iommu, Stephen Bates,
	John Hubbard, Don Dutile, Matthew Wilcox, Daniel Vetter,
	Jakowski Andrzej, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni

On Mon, Oct 04, 2021 at 08:58:35AM +0200, Christian König wrote:
> I'm not following this discussion to closely, but try to look into it from
> time to time.
> 
> Am 01.10.21 um 19:45 schrieb Jason Gunthorpe:
> > On Fri, Oct 01, 2021 at 11:01:49AM -0600, Logan Gunthorpe wrote:
> > 
> > > In device-dax, the refcount is only used to prevent the device, and
> > > therefore the pages, from going away on device unbind. Pages cannot be
> > > recycled, as you say, as they are mapped linearly within the device. The
> > > address space invalidation is done only when the device is unbound.
> > By address space invalidation I mean invalidation of the VMA that is
> > pointing to those pages.
> > 
> > device-dax may not have a issue with use-after-VMA-invalidation by
> > it's very nature since every PFN always points to the same
> > thing. fsdax and this p2p stuff are different though.
> > 
> > > Before the invalidation, an active flag is cleared to ensure no new
> > > mappings can be created while the unmap is proceeding.
> > > unmap_mapping_range() should sequence itself with the TLB flush and
> > AFIAK unmap_mapping_range() kicks off the TLB flush and then
> > returns. It doesn't always wait for the flush to fully finish. Ie some
> > cases use RCU to lock the page table against GUP fast and so the
> > put_page() doesn't happen until the call_rcu completes - after a grace
> > period. The unmap_mapping_range() does not wait for grace periods.
> 
> Wow, wait a second. That is quite a boomer. At least in all GEM/TTM based
> graphics drivers that could potentially cause a lot of trouble.
> 
> I've just double checked and we certainly have the assumption that when
> unmap_mapping_range() returns the pte is gone and the TLB flush completed in
> quite a number of places.
> 
> Do you have more information when and why that can happen?

There are two things to keep in mind, flushing the PTEs from the HW
and serializing against gup_fast.

If you start at unmap_mapping_range() the page is eventually
discovered in zap_pte_range() and the PTE cleared. It is then passed
into __tlb_remove_page() which puts it on the batch->pages list

The page free happens in tlb_batch_pages_flush() via
free_pages_and_swap_cache()

The tlb_batch_pages_flush() happens via zap_page_range() ->
tlb_finish_mmu(), presumably after the HW has wiped the TLB's on all
CPUs. On x86 this is done with an IPI and also serializes gup fast, so
OK

The interesting case is CONFIG_MMU_GATHER_RCU_TABLE_FREE which doesn't
rely on IPIs anymore to synchronize with gup-fast.

In this configuration it means when unmap_mapping_range() returns the
TLB will have been flushed, but no serialization with GUP fast was
done.

This is OK if the GUP fast cannot return the page at all. I assume
this generally describes the DRM caes?

However, if the GUP fast can return the page then something,
somewhere, needs to serialize the page free with the RCU as the GUP
fast can be observing the old PTE before it was zap'd until the RCU
grace expires.

Relying on the page ref being !0 to protect GUP fast is not safe
because the page ref can be incr'd immediately upon page re-use.

Interestingly I looked around for this on PPC and I only found RCU
delayed freeing of the page table level, not RCU delayed freeing of
pages themselves.. I wonder if it was missed? 

There is a path on PPC (tlb_remove_table_sync_one) that triggers an
IPI but it looks like an exception, and we wouldn't need the RCU at
all if we used IPI to serialize GUP fast...

It makes logical sense if the RCU also frees the pages on
CONFIG_MMU_GATHER_RCU_TABLE_FREE so anything returnable by GUP fast
must be refcounted and freed by tlb_batch_pages_flush(), not by the
caller of unmap_mapping_range().

If we expect to allow the caller of unmap_mapping_range() to free then
CONFIG_MMU_GATHER_RCU_TABLE_FREE can't really exist, we always need to
trigger a serializing IPI during tlb_batch_pages_flush()

AFAICT, at least

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-04 13:11                         ` Jason Gunthorpe
@ 2021-10-04 13:22                           ` Christian König
  2021-10-04 13:27                             ` Jason Gunthorpe
  0 siblings, 1 reply; 87+ messages in thread
From: Christian König @ 2021-10-04 13:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Alistair Popple, Felix Kuehling,
	Christoph Hellwig, Dan Williams, linux-kernel, linux-nvme,
	linux-block, linux-pci, linux-mm, iommu, Stephen Bates,
	John Hubbard, Don Dutile, Matthew Wilcox, Daniel Vetter,
	Jakowski Andrzej, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni

Am 04.10.21 um 15:11 schrieb Jason Gunthorpe:
> On Mon, Oct 04, 2021 at 08:58:35AM +0200, Christian König wrote:
>> I'm not following this discussion to closely, but try to look into it from
>> time to time.
>>
>> Am 01.10.21 um 19:45 schrieb Jason Gunthorpe:
>>> On Fri, Oct 01, 2021 at 11:01:49AM -0600, Logan Gunthorpe wrote:
>>>
>>>> In device-dax, the refcount is only used to prevent the device, and
>>>> therefore the pages, from going away on device unbind. Pages cannot be
>>>> recycled, as you say, as they are mapped linearly within the device. The
>>>> address space invalidation is done only when the device is unbound.
>>> By address space invalidation I mean invalidation of the VMA that is
>>> pointing to those pages.
>>>
>>> device-dax may not have a issue with use-after-VMA-invalidation by
>>> it's very nature since every PFN always points to the same
>>> thing. fsdax and this p2p stuff are different though.
>>>
>>>> Before the invalidation, an active flag is cleared to ensure no new
>>>> mappings can be created while the unmap is proceeding.
>>>> unmap_mapping_range() should sequence itself with the TLB flush and
>>> AFIAK unmap_mapping_range() kicks off the TLB flush and then
>>> returns. It doesn't always wait for the flush to fully finish. Ie some
>>> cases use RCU to lock the page table against GUP fast and so the
>>> put_page() doesn't happen until the call_rcu completes - after a grace
>>> period. The unmap_mapping_range() does not wait for grace periods.
>> Wow, wait a second. That is quite a boomer. At least in all GEM/TTM based
>> graphics drivers that could potentially cause a lot of trouble.
>>
>> I've just double checked and we certainly have the assumption that when
>> unmap_mapping_range() returns the pte is gone and the TLB flush completed in
>> quite a number of places.
>>
>> Do you have more information when and why that can happen?
> There are two things to keep in mind, flushing the PTEs from the HW
> and serializing against gup_fast.
>
> If you start at unmap_mapping_range() the page is eventually
> discovered in zap_pte_range() and the PTE cleared. It is then passed
> into __tlb_remove_page() which puts it on the batch->pages list
>
> The page free happens in tlb_batch_pages_flush() via
> free_pages_and_swap_cache()
>
> The tlb_batch_pages_flush() happens via zap_page_range() ->
> tlb_finish_mmu(), presumably after the HW has wiped the TLB's on all
> CPUs. On x86 this is done with an IPI and also serializes gup fast, so
> OK
>
> The interesting case is CONFIG_MMU_GATHER_RCU_TABLE_FREE which doesn't
> rely on IPIs anymore to synchronize with gup-fast.
>
> In this configuration it means when unmap_mapping_range() returns the
> TLB will have been flushed, but no serialization with GUP fast was
> done.
>
> This is OK if the GUP fast cannot return the page at all. I assume
> this generally describes the DRM caes?

Yes, exactly that. GUP is completely forbidden for such mappings.

But what about accesses by other CPUs? In other words our use case is 
like the following:

1. We found that we need exclusive access to the higher level object a 
page belongs to.

2. The lock of the higher level object is taken. The lock is also taken 
in the fault handler for the VMA which inserts the PTE in the first place.

3. unmap_mapping_range() for the range of the object is called, the 
expectation is that when that function returns only the kernel can have 
a mapping of the pages backing the object.

4. The kernel has exclusive access to the pages and we know that 
userspace can't mess with them any more.

That use case is completely unrelated to GUP and when this doesn't work 
we have quite a problem.

I should probably note that we recently switched from VM_MIXEDMAP to 
using VM_PFNMAP because the former didn't prevented GUP on all 
architectures.

Christian.

> However, if the GUP fast can return the page then something,
> somewhere, needs to serialize the page free with the RCU as the GUP
> fast can be observing the old PTE before it was zap'd until the RCU
> grace expires.
>
> Relying on the page ref being !0 to protect GUP fast is not safe
> because the page ref can be incr'd immediately upon page re-use.
>
> Interestingly I looked around for this on PPC and I only found RCU
> delayed freeing of the page table level, not RCU delayed freeing of
> pages themselves.. I wonder if it was missed?
>
> There is a path on PPC (tlb_remove_table_sync_one) that triggers an
> IPI but it looks like an exception, and we wouldn't need the RCU at
> all if we used IPI to serialize GUP fast...
>
> It makes logical sense if the RCU also frees the pages on
> CONFIG_MMU_GATHER_RCU_TABLE_FREE so anything returnable by GUP fast
> must be refcounted and freed by tlb_batch_pages_flush(), not by the
> caller of unmap_mapping_range().
>
> If we expect to allow the caller of unmap_mapping_range() to free then
> CONFIG_MMU_GATHER_RCU_TABLE_FREE can't really exist, we always need to
> trigger a serializing IPI during tlb_batch_pages_flush()
>
> AFAICT, at least
>
> Jason


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-04 13:22                           ` Christian König
@ 2021-10-04 13:27                             ` Jason Gunthorpe
  2021-10-04 14:54                               ` Christian König
  0 siblings, 1 reply; 87+ messages in thread
From: Jason Gunthorpe @ 2021-10-04 13:27 UTC (permalink / raw)
  To: Christian König
  Cc: Logan Gunthorpe, Alistair Popple, Felix Kuehling,
	Christoph Hellwig, Dan Williams, linux-kernel, linux-nvme,
	linux-block, linux-pci, linux-mm, iommu, Stephen Bates,
	John Hubbard, Don Dutile, Matthew Wilcox, Daniel Vetter,
	Jakowski Andrzej, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni

On Mon, Oct 04, 2021 at 03:22:22PM +0200, Christian König wrote:

> That use case is completely unrelated to GUP and when this doesn't work we
> have quite a problem.

My read is that unmap_mapping_range() guarentees the physical TLB
hardware is serialized across all CPUs upon return.

It also guarentees GUP slow is serialized due to the page table
spinlocks.

Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
  2021-10-04 13:27                             ` Jason Gunthorpe
@ 2021-10-04 14:54                               ` Christian König
  0 siblings, 0 replies; 87+ messages in thread
From: Christian König @ 2021-10-04 14:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Logan Gunthorpe, Alistair Popple, Felix Kuehling,
	Christoph Hellwig, Dan Williams, linux-kernel, linux-nvme,
	linux-block, linux-pci, linux-mm, iommu, Stephen Bates,
	John Hubbard, Don Dutile, Matthew Wilcox, Daniel Vetter,
	Jakowski Andrzej, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni

Am 04.10.21 um 15:27 schrieb Jason Gunthorpe:
> On Mon, Oct 04, 2021 at 03:22:22PM +0200, Christian König wrote:
>
>> That use case is completely unrelated to GUP and when this doesn't work we
>> have quite a problem.
> My read is that unmap_mapping_range() guarentees the physical TLB
> hardware is serialized across all CPUs upon return.

Thanks, that's what I wanted to make sure.

Christian.

>
> It also guarentees GUP slow is serialized due to the page table
> spinlocks.
>
> Jason


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 10/20] nvme-pci: convert to using dma_map_sgtable()
  2021-09-16 23:40 ` [PATCH v3 10/20] nvme-pci: convert to using dma_map_sgtable() Logan Gunthorpe
@ 2021-10-05 22:29   ` Max Gurtovoy
  0 siblings, 0 replies; 87+ messages in thread
From: Max Gurtovoy @ 2021-10-05 22:29 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

Logan,

On 9/17/2021 2:40 AM, Logan Gunthorpe wrote:
> The dma_map operations now support P2PDMA pages directly. So remove
> the calls to pci_p2pdma_[un]map_sg_attrs() and replace them with calls
> to dma_map_sgtable().
>
> dma_map_sgtable() returns more complete error codes than dma_map_sg()
> and allows differentiating EREMOTEIO errors in case an unsupported
> P2PDMA transfer is requested. When this happens, return BLK_STS_TARGET
> so the request isn't retried.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>   drivers/nvme/host/pci.c | 69 +++++++++++++++++------------------------
>   1 file changed, 29 insertions(+), 40 deletions(-)

Looks good,

Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 11/20] RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
  2021-09-16 23:40 ` [PATCH v3 11/20] RDMA/core: introduce ib_dma_pci_p2p_dma_supported() Logan Gunthorpe
  2021-09-28 19:17   ` Jason Gunthorpe
@ 2021-10-05 22:31   ` Max Gurtovoy
  1 sibling, 0 replies; 87+ messages in thread
From: Max Gurtovoy @ 2021-10-05 22:31 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni

Logan,

On 9/17/2021 2:40 AM, Logan Gunthorpe wrote:
> Introduce the helper function ib_dma_pci_p2p_dma_supported() to check
> if a given ib_device can be used in P2PDMA transfers. This ensures
> the ib_device is not using virt_dma and also that the underlying
> dma_device supports P2PDMA.
>
> Use the new helper in nvme-rdma to replace the existing check for
> ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows
> switching away from pci_p2pdma_[un]map_sg().
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>   drivers/nvme/target/rdma.c |  2 +-
>   include/rdma/ib_verbs.h    | 11 +++++++++++
>   2 files changed, 12 insertions(+), 1 deletion(-)

Looks good,

Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 12/20] RDMA/rw: use dma_map_sgtable()
  2021-09-28 19:43   ` Jason Gunthorpe
  2021-09-29 22:56     ` Logan Gunthorpe
@ 2021-10-05 22:40     ` Max Gurtovoy
  1 sibling, 0 replies; 87+ messages in thread
From: Max Gurtovoy @ 2021-10-05 22:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	iommu, Stephen Bates, Christoph Hellwig, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni


On 9/28/2021 10:43 PM, Jason Gunthorpe wrote:
> On Thu, Sep 16, 2021 at 05:40:52PM -0600, Logan Gunthorpe wrote:
>> dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
>> is no longer necessary and may be dropped.
>>
>> Switch to the dma_map_sgtable() interface which will allow for better
>> error reporting if the P2PDMA pages are unsupported.
>>
>> The change to sgtable also appears to fix a couple subtle error path
>> bugs:
>>
>>    - In rdma_rw_ctx_init(), dma_unmap would be called with an sg
>>      that could have been incremented from the original call, as
>>      well as an nents that was not the original number of nents
>>      called when mapped.
>>    - Similarly in rdma_rw_ctx_signature_init, both sg and prot_sg
>>      were unmapped with the incorrect number of nents.
> Those bugs should definately get fixed.. I might extract the sgtable
> conversion into a stand alone patch to do it.

Yes, we need these fixes before this series will converge.

Looks good,

Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>

>
> But as it is, this looks fine
>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>
> Jason

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 13/20] PCI/P2PDMA: remove pci_p2pdma_[un]map_sg()
  2021-09-16 23:40 ` [PATCH v3 13/20] PCI/P2PDMA: remove pci_p2pdma_[un]map_sg() Logan Gunthorpe
  2021-09-27 18:50   ` Bjorn Helgaas
  2021-09-28 19:43   ` Jason Gunthorpe
@ 2021-10-05 22:42   ` Max Gurtovoy
  2 siblings, 0 replies; 87+ messages in thread
From: Max Gurtovoy @ 2021-10-05 22:42 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
	linux-pci, linux-mm, iommu
  Cc: Stephen Bates, Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Jakowski Andrzej, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni


On 9/17/2021 2:40 AM, Logan Gunthorpe wrote:
> This interface is superseded by support in dma_map_sg() which now supports
> heterogeneous scatterlists. There are no longer any users, so remove it.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>   drivers/pci/p2pdma.c       | 65 --------------------------------------
>   include/linux/pci-p2pdma.h | 27 ----------------
>   2 files changed, 92 deletions(-)

Looks good,

Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>



^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2021-10-05 22:42 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-16 23:40 [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
2021-09-16 23:40 ` [PATCH v3 01/20] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL Logan Gunthorpe
2021-09-28 18:32   ` Jason Gunthorpe
2021-09-29 21:15     ` Logan Gunthorpe
2021-09-30  4:47   ` Chaitanya Kulkarni
2021-09-30 16:49     ` Logan Gunthorpe
2021-09-30  4:57   ` Chaitanya Kulkarni
2021-09-16 23:40 ` [PATCH v3 02/20] PCI/P2PDMA: attempt to set map_type if it has not been set Logan Gunthorpe
2021-09-27 18:50   ` Bjorn Helgaas
2021-09-16 23:40 ` [PATCH v3 03/20] PCI/P2PDMA: make pci_p2pdma_map_type() non-static Logan Gunthorpe
2021-09-27 18:46   ` Bjorn Helgaas
2021-09-28 18:48   ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 04/20] PCI/P2PDMA: introduce helpers for dma_map_sg implementations Logan Gunthorpe
2021-09-27 18:53   ` Bjorn Helgaas
2021-09-27 19:59     ` Logan Gunthorpe
2021-09-28 18:55   ` Jason Gunthorpe
2021-09-29 21:26     ` Logan Gunthorpe
2021-09-28 22:05   ` [PATCH v3 4/20] " Jason Gunthorpe
2021-09-29 21:30     ` Logan Gunthorpe
2021-09-29 22:46       ` Jason Gunthorpe
2021-09-29 23:00         ` Logan Gunthorpe
2021-09-29 23:40           ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 05/20] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers Logan Gunthorpe
2021-09-28 18:57   ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 06/20] dma-direct: support PCI P2PDMA pages in dma-direct map_sg Logan Gunthorpe
2021-09-28 19:08   ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 07/20] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support Logan Gunthorpe
2021-09-28 19:11   ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 08/20] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg Logan Gunthorpe
2021-09-28 19:15   ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 09/20] nvme-pci: check DMA ops when indicating support for PCI P2PDMA Logan Gunthorpe
2021-09-30  5:06   ` Chaitanya Kulkarni
2021-09-30 16:51     ` Logan Gunthorpe
2021-09-30 17:19       ` Chaitanya Kulkarni
2021-09-16 23:40 ` [PATCH v3 10/20] nvme-pci: convert to using dma_map_sgtable() Logan Gunthorpe
2021-10-05 22:29   ` Max Gurtovoy
2021-09-16 23:40 ` [PATCH v3 11/20] RDMA/core: introduce ib_dma_pci_p2p_dma_supported() Logan Gunthorpe
2021-09-28 19:17   ` Jason Gunthorpe
2021-10-05 22:31   ` Max Gurtovoy
2021-09-16 23:40 ` [PATCH v3 12/20] RDMA/rw: use dma_map_sgtable() Logan Gunthorpe
2021-09-28 19:43   ` Jason Gunthorpe
2021-09-29 22:56     ` Logan Gunthorpe
2021-10-05 22:40     ` Max Gurtovoy
2021-09-16 23:40 ` [PATCH v3 13/20] PCI/P2PDMA: remove pci_p2pdma_[un]map_sg() Logan Gunthorpe
2021-09-27 18:50   ` Bjorn Helgaas
2021-09-28 19:43   ` Jason Gunthorpe
2021-10-05 22:42   ` Max Gurtovoy
2021-09-16 23:40 ` [PATCH v3 14/20] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
2021-09-28 19:47   ` Jason Gunthorpe
2021-09-29 21:34     ` Logan Gunthorpe
2021-09-29 22:48       ` Jason Gunthorpe
2021-09-16 23:40 ` [PATCH v3 15/20] iov_iter: introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
2021-09-16 23:40 ` [PATCH v3 16/20] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
2021-09-16 23:40 ` [PATCH v3 17/20] block: set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
2021-09-16 23:40 ` [PATCH v3 18/20] mm: use custom page_free for P2PDMA pages Logan Gunthorpe
2021-09-16 23:40 ` [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem() Logan Gunthorpe
2021-09-27 18:49   ` Bjorn Helgaas
2021-09-28 19:55   ` Jason Gunthorpe
2021-09-29 21:42     ` Logan Gunthorpe
2021-09-29 23:05       ` Jason Gunthorpe
2021-09-29 23:27         ` Logan Gunthorpe
2021-09-29 23:35           ` Jason Gunthorpe
2021-09-29 23:49             ` Logan Gunthorpe
2021-09-30  0:36               ` Jason Gunthorpe
2021-10-01 13:48                 ` Jason Gunthorpe
2021-10-01 17:01                   ` Logan Gunthorpe
2021-10-01 17:45                     ` Jason Gunthorpe
2021-10-01 20:13                       ` Logan Gunthorpe
2021-10-01 22:14                         ` Jason Gunthorpe
2021-10-01 22:22                           ` Logan Gunthorpe
2021-10-01 22:46                             ` Jason Gunthorpe
2021-10-01 23:27                               ` John Hubbard
2021-10-01 23:34                               ` Logan Gunthorpe
2021-10-04  6:58                       ` Christian König
2021-10-04 13:11                         ` Jason Gunthorpe
2021-10-04 13:22                           ` Christian König
2021-10-04 13:27                             ` Jason Gunthorpe
2021-10-04 14:54                               ` Christian König
2021-09-28 20:05   ` Jason Gunthorpe
2021-09-29 21:46     ` Logan Gunthorpe
2021-09-16 23:41 ` [PATCH v3 20/20] nvme-pci: allow mmaping the CMB in userspace Logan Gunthorpe
2021-09-28 20:02 ` [PATCH v3 00/20] Userspace P2PDMA with O_DIRECT NVMe devices Jason Gunthorpe
2021-09-29 21:50   ` Logan Gunthorpe
2021-09-29 23:21     ` Jason Gunthorpe
2021-09-29 23:28       ` Logan Gunthorpe
2021-09-29 23:36         ` Jason Gunthorpe
2021-09-29 23:52           ` Logan Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).