linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices
@ 2022-09-22 16:39 Logan Gunthorpe
  2022-09-22 16:39 ` [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-22 16:39 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
  Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Ralph Campbell, Stephen Bates, Logan Gunthorpe

Hi,

This is the latest P2PDMA userspace patch set. This version includes
some cleanup from feedback of the last posting[1].

This patch set enables userspace P2PDMA by allowing userspace to mmap()
allocated chunks of the CMB. The resulting VMA can be passed only
to O_DIRECT IO on NVMe backed files or block devices. A flag is added
to GUP() in Patch 1, then Patches 2 through 6 wire this flag up based
on whether the block queue indicates P2PDMA support. Patches 7
creates the sysfs resource that can hand out the VMAs and Patch 8
adds brief documentation for the new interface.

Feedback welcome.

This series is based on v6.0-rc6. A git branch is available here:

  https://github.com/sbates130272/linux-p2pmem/  p2pdma_user_cmb_v10

Thanks,

Logan

[1] https://lkml.kernel.org/r/20220825152425.6296-1-logang@deltatee.com

--

Changes since v8:
  - Rebased onto v6.0-rc6
  - Reworked iov iter changes to reuse the code better and
    name them without the _flags() prefix (per Christoph)
  - Renamed a number of flags variables to gup_flags (per John)
  - Minor fixups to the last documentation patch (from Greg and John)

Changes since v7:
  - Rebased onto v6.0-rc2, included reworking the iov_iter patch
    due to changes there
  - Drop the char device mmap implementation in favour of a sysfs
    based interface. (per Christoph)

Changes since v6:
  - Rebase onto v5.19-rc1
  - Rework how the pages are stored in the VMA per Jason's suggestion

Changes since v5:
  - Rebased onto v5.18-rc1 which includes Christophs cleanup to
    free_zone_device_page() (similar to Ralph's patch).
  - Fix bug with concurrent first calls to pci_p2pdma_vma_fault()
    that caused a double allocation and lost p2p memory. Noticed
    by Andrew Maier.
  - Collected a Reviewed-by tag from Chaitanya.
  - Numerous minor fixes to commit messages

--

Logan Gunthorpe (8):
  mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
  block: add check when merging zone device pages
  lib/scatterlist: add check when merging zone device pages
  block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
  block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
  PCI/P2PDMA: Allow userspace VMA allocations through sysfs
  ABI: sysfs-bus-pci: add documentation for p2pmem allocate

 Documentation/ABI/testing/sysfs-bus-pci |  10 ++
 block/bio.c                             |  11 ++-
 block/blk-map.c                         |   7 +-
 drivers/pci/p2pdma.c                    | 124 ++++++++++++++++++++++++
 include/linux/mm.h                      |   1 +
 include/linux/mmzone.h                  |  24 +++++
 include/linux/uio.h                     |   6 ++
 lib/iov_iter.c                          |  32 ++++--
 lib/scatterlist.c                       |  25 +++--
 mm/gup.c                                |  22 ++++-
 10 files changed, 240 insertions(+), 22 deletions(-)


base-commit: 521a547ced6477c54b4b0cc206000406c221b4d6
--
2.30.2

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-22 16:39 [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
@ 2022-09-22 16:39 ` Logan Gunthorpe
  2022-09-23 18:13   ` Jason Gunthorpe
  2022-09-22 16:39 ` [PATCH v10 2/8] iov_iter: introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-22 16:39 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
  Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Ralph Campbell, Stephen Bates, Logan Gunthorpe

GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If GUP is called without the flag and a
P2PDMA page is found, it will return an error.

FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 22 +++++++++++++++++++++-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..3cea77c8a9ea 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2897,6 +2897,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA	0x100000 /* allow returning PCI P2PDMA pages */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 5abdaf487460..108848b67f6f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -601,6 +601,12 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		goto out;
 	}
 
+	if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+		     is_pci_p2pdma_page(page))) {
+		page = ERR_PTR(-EREMOTEIO);
+		goto out;
+	}
+
 	VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
 		       !PageAnonExclusive(page), page);
 
@@ -1039,6 +1045,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
 		return -EOPNOTSUPP;
 
+	if ((gup_flags & FOLL_LONGTERM) && (gup_flags & FOLL_PCI_P2PDMA))
+		return -EOPNOTSUPP;
+
 	if (vma_is_secretmem(vma))
 		return -EFAULT;
 
@@ -2383,6 +2392,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
+		if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+			     is_pci_p2pdma_page(page)))
+			goto pte_unmap;
+
 		folio = try_grab_folio(page, 1, flags);
 		if (!folio)
 			goto pte_unmap;
@@ -2462,6 +2475,12 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 			undo_dev_pagemap(nr, nr_start, flags, pages);
 			break;
 		}
+
+		if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+			undo_dev_pagemap(nr, nr_start, flags, pages);
+			break;
+		}
+
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		if (unlikely(!try_grab_page(page, flags))) {
@@ -2950,7 +2969,8 @@ static int internal_get_user_pages_fast(unsigned long start,
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
 				       FOLL_FORCE | FOLL_PIN | FOLL_GET |
-				       FOLL_FAST_ONLY | FOLL_NOFAULT)))
+				       FOLL_FAST_ONLY | FOLL_NOFAULT |
+				       FOLL_PCI_P2PDMA)))
 		return -EINVAL;
 
 	if (gup_flags & FOLL_PIN)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v10 2/8] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
  2022-09-22 16:39 [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
  2022-09-22 16:39 ` [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
@ 2022-09-22 16:39 ` Logan Gunthorpe
  2022-09-22 16:39 ` [PATCH v10 3/8] block: add check when merging zone device pages Logan Gunthorpe
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-22 16:39 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
  Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Ralph Campbell, Stephen Bates, Logan Gunthorpe

Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().

This is so that FOLL_PCI_P2PDMA can be passed when appropriate.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 include/linux/uio.h |  6 ++++++
 lib/iov_iter.c      | 32 ++++++++++++++++++++++++--------
 2 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5896af36199c..5d976d01ccb9 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -247,8 +247,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
 		     loff_t start, size_t count);
+ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
+		size_t maxsize, unsigned maxpages, size_t *start,
+		unsigned gup_flags);
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
+ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+		struct page ***pages, size_t maxsize, size_t *start,
+		unsigned gup_flags);
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 4b7fce72e3e5..8f089d661a41 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1427,7 +1427,8 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
-		   unsigned int maxpages, size_t *start)
+		   unsigned int maxpages, size_t *start,
+		   unsigned int gup_flags)
 {
 	unsigned int n;
 
@@ -1439,7 +1440,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		maxsize = MAX_RW_COUNT;
 
 	if (likely(user_backed_iter(i))) {
-		unsigned int gup_flags = 0;
 		unsigned long addr;
 		int res;
 
@@ -1489,33 +1489,49 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 	return -EFAULT;
 }
 
-ssize_t iov_iter_get_pages2(struct iov_iter *i,
+ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
-		   size_t *start)
+		   size_t *start, unsigned gup_flags)
 {
 	if (!maxpages)
 		return 0;
 	BUG_ON(!pages);
 
-	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
+	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
+					  start, gup_flags);
+}
+EXPORT_SYMBOL_GPL(iov_iter_get_pages);
+
+ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
+		size_t maxsize, unsigned maxpages, size_t *start)
+{
+	return iov_iter_get_pages(i, pages, maxsize, maxpages, start, 0);
 }
 EXPORT_SYMBOL(iov_iter_get_pages2);
 
-ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
+ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
-		   size_t *start)
+		   size_t *start, unsigned gup_flags)
 {
 	ssize_t len;
 
 	*pages = NULL;
 
-	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start);
+	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
+					 gup_flags);
 	if (len <= 0) {
 		kvfree(*pages);
 		*pages = NULL;
 	}
 	return len;
 }
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc);
+
+ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
+		struct page ***pages, size_t maxsize, size_t *start)
+{
+	return iov_iter_get_pages_alloc(i, pages, maxsize, start, 0);
+}
 EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
 
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v10 3/8] block: add check when merging zone device pages
  2022-09-22 16:39 [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
  2022-09-22 16:39 ` [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
  2022-09-22 16:39 ` [PATCH v10 2/8] iov_iter: introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
@ 2022-09-22 16:39 ` Logan Gunthorpe
  2022-09-22 16:39 ` [PATCH v10 4/8] lib/scatterlist: " Logan Gunthorpe
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-22 16:39 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
  Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Ralph Campbell, Stephen Bates, Logan Gunthorpe

Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.

Add a helper to determine if zone device pages are mergeable and use
this helper in page_is_mergeable().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
 block/bio.c            |  2 ++
 include/linux/mmzone.h | 24 ++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 3d3a2678fea2..969607bc1f4d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -865,6 +865,8 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
 		return false;
 	if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
 		return false;
+	if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
+		return false;
 
 	*same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
 	if (*same_page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e24b40c52468..2c31915b057e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -794,6 +794,25 @@ static inline bool is_zone_device_page(const struct page *page)
 {
 	return page_zonenum(page) == ZONE_DEVICE;
 }
+
+/*
+ * Consecutive zone device pages should not be merged into the same sgl
+ * or bvec segment with other types of pages or if they belong to different
+ * pgmaps. Otherwise getting the pgmap of a given segment is not possible
+ * without scanning the entire segment. This helper returns true either if
+ * both pages are not zone device pages or both pages are zone device pages
+ * with the same pgmap.
+ */
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+						     const struct page *b)
+{
+	if (is_zone_device_page(a) != is_zone_device_page(b))
+		return false;
+	if (!is_zone_device_page(a))
+		return true;
+	return a->pgmap == b->pgmap;
+}
+
 extern void memmap_init_zone_device(struct zone *, unsigned long,
 				    unsigned long, struct dev_pagemap *);
 #else
@@ -801,6 +820,11 @@ static inline bool is_zone_device_page(const struct page *page)
 {
 	return false;
 }
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+						     const struct page *b)
+{
+	return true;
+}
 #endif
 
 static inline bool folio_is_zone_device(const struct folio *folio)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v10 4/8] lib/scatterlist: add check when merging zone device pages
  2022-09-22 16:39 [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (2 preceding siblings ...)
  2022-09-22 16:39 ` [PATCH v10 3/8] block: add check when merging zone device pages Logan Gunthorpe
@ 2022-09-22 16:39 ` Logan Gunthorpe
  2022-09-22 16:39 ` [PATCH v10 5/8] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-22 16:39 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
  Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Ralph Campbell, Stephen Bates, Logan Gunthorpe

Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.

Factor out the check for page mergability into a pages_are_mergable()
helper and add a check with zone_device_pages_are_mergeable().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 lib/scatterlist.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c8c3d675845c..a0ad2a7959b5 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -410,6 +410,15 @@ static struct scatterlist *get_next_sg(struct sg_append_table *table,
 	return new_sg;
 }
 
+static bool pages_are_mergeable(struct page *a, struct page *b)
+{
+	if (page_to_pfn(a) != page_to_pfn(b) + 1)
+		return false;
+	if (!zone_device_pages_have_same_pgmap(a, b))
+		return false;
+	return true;
+}
+
 /**
  * sg_alloc_append_table_from_pages - Allocate and initialize an append sg
  *                                    table from an array of pages
@@ -447,6 +456,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
 	unsigned int chunks, cur_page, seg_len, i, prv_len = 0;
 	unsigned int added_nents = 0;
 	struct scatterlist *s = sgt_append->prv;
+	struct page *last_pg;
 
 	/*
 	 * The algorithm below requires max_segment to be aligned to PAGE_SIZE
@@ -460,21 +470,17 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
 		return -EOPNOTSUPP;
 
 	if (sgt_append->prv) {
-		unsigned long paddr =
-			(page_to_pfn(sg_page(sgt_append->prv)) * PAGE_SIZE +
-			 sgt_append->prv->offset + sgt_append->prv->length) /
-			PAGE_SIZE;
-
 		if (WARN_ON(offset))
 			return -EINVAL;
 
 		/* Merge contiguous pages into the last SG */
 		prv_len = sgt_append->prv->length;
-		while (n_pages && page_to_pfn(pages[0]) == paddr) {
+		last_pg = sg_page(sgt_append->prv);
+		while (n_pages && pages_are_mergeable(last_pg, pages[0])) {
 			if (sgt_append->prv->length + PAGE_SIZE > max_segment)
 				break;
 			sgt_append->prv->length += PAGE_SIZE;
-			paddr++;
+			last_pg = pages[0];
 			pages++;
 			n_pages--;
 		}
@@ -488,7 +494,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
 	for (i = 1; i < n_pages; i++) {
 		seg_len += PAGE_SIZE;
 		if (seg_len >= max_segment ||
-		    page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1) {
+		    !pages_are_mergeable(pages[i], pages[i - 1])) {
 			chunks++;
 			seg_len = 0;
 		}
@@ -504,8 +510,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
 		for (j = cur_page + 1; j < n_pages; j++) {
 			seg_len += PAGE_SIZE;
 			if (seg_len >= max_segment ||
-			    page_to_pfn(pages[j]) !=
-			    page_to_pfn(pages[j - 1]) + 1)
+			    !pages_are_mergeable(pages[j], pages[j - 1]))
 				break;
 		}
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v10 5/8] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
  2022-09-22 16:39 [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (3 preceding siblings ...)
  2022-09-22 16:39 ` [PATCH v10 4/8] lib/scatterlist: " Logan Gunthorpe
@ 2022-09-22 16:39 ` Logan Gunthorpe
  2022-09-22 16:39 ` [PATCH v10 6/8] block: set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-22 16:39 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
  Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Ralph Campbell, Stephen Bates, Logan Gunthorpe

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
 block/bio.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 969607bc1f4d..b5f7e9b493fc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1200,6 +1200,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
+	unsigned int gup_flags = 0;
 	ssize_t size, left;
 	unsigned len, i = 0;
 	size_t offset, trim;
@@ -1213,6 +1214,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
+	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
+		gup_flags |= FOLL_PCI_P2PDMA;
+
 	/*
 	 * Each segment in the iov is required to be a block size multiple.
 	 * However, we may not be able to get the entire segment if it spans
@@ -1220,8 +1224,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages2(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset);
+	size = iov_iter_get_pages(iter, pages,
+				  UINT_MAX - bio->bi_iter.bi_size,
+				  nr_pages, &offset, gup_flags);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v10 6/8] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
  2022-09-22 16:39 [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (4 preceding siblings ...)
  2022-09-22 16:39 ` [PATCH v10 5/8] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
@ 2022-09-22 16:39 ` Logan Gunthorpe
  2022-09-22 16:39 ` [PATCH v10 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs Logan Gunthorpe
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-22 16:39 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
  Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Ralph Campbell, Stephen Bates, Logan Gunthorpe

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
 block/blk-map.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 7196a6b64c80..7882504b99ac 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -236,6 +236,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 {
 	unsigned int max_sectors = queue_max_hw_sectors(rq->q);
 	unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
+	unsigned int gup_flags = 0;
 	struct bio *bio;
 	int ret;
 	int j;
@@ -248,13 +249,17 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		return -ENOMEM;
 	bio_init(bio, NULL, bio->bi_inline_vecs, nr_vecs, req_op(rq));
 
+	if (blk_queue_pci_p2pdma(rq->q))
+		gup_flags |= FOLL_PCI_P2PDMA;
+
 	while (iov_iter_count(iter)) {
 		struct page **pages;
 		ssize_t bytes;
 		size_t offs, added = 0;
 		int npages;
 
-		bytes = iov_iter_get_pages_alloc2(iter, &pages, LONG_MAX, &offs);
+		bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX,
+						 &offs, gup_flags);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v10 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs
  2022-09-22 16:39 [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (5 preceding siblings ...)
  2022-09-22 16:39 ` [PATCH v10 6/8] block: set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
@ 2022-09-22 16:39 ` Logan Gunthorpe
  2022-09-22 18:27   ` Bjorn Helgaas
  2022-09-23  8:15   ` Greg Kroah-Hartman
  2022-09-22 16:39 ` [PATCH v10 8/8] ABI: sysfs-bus-pci: add documentation for p2pmem allocate Logan Gunthorpe
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-22 16:39 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
  Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Ralph Campbell, Stephen Bates, Logan Gunthorpe

Create a sysfs bin attribute called "allocate" under the existing
"p2pmem" group. The only allowable operation on this file is the mmap()
call.

When mmap() is called on this attribute, the kernel allocates a chunk of
memory from the genalloc and inserts the pages into the VMA. The
dev_pagemap .page_free callback will indicate when these pages are no
longer used and they will be put back into the genalloc.

On device unbind, remove the sysfs file before the memremap_pages are
cleaned up. This ensures unmap_mapping_range() is called on the files
inode and no new mappings can be created.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/pci/p2pdma.c | 124 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 124 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4496a7c5c478..a6ed6bbca214 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -89,6 +89,90 @@ static ssize_t published_show(struct device *dev, struct device_attribute *attr,
 }
 static DEVICE_ATTR_RO(published);
 
+static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
+		struct bin_attribute *attr, struct vm_area_struct *vma)
+{
+	struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
+	size_t len = vma->vm_end - vma->vm_start;
+	struct pci_p2pdma *p2pdma;
+	struct percpu_ref *ref;
+	unsigned long vaddr;
+	void *kaddr;
+	int ret;
+
+	/* prevent private mappings from being established */
+	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+		pci_info_ratelimited(pdev,
+				     "%s: fail, attempted private mapping\n",
+				     current->comm);
+		return -EINVAL;
+	}
+
+	if (vma->vm_pgoff) {
+		pci_info_ratelimited(pdev,
+				     "%s: fail, attempted mapping with non-zero offset\n",
+				     current->comm);
+		return -EINVAL;
+	}
+
+	rcu_read_lock();
+	p2pdma = rcu_dereference(pdev->p2pdma);
+	if (!p2pdma) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref);
+	if (!kaddr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/*
+	 * vm_insert_page() can sleep, so a reference is taken to mapping
+	 * such that rcu_read_unlock() can be done before inserting the
+	 * pages
+	 */
+	if (unlikely(!percpu_ref_tryget_live_rcu(ref))) {
+		ret = -ENODEV;
+		goto out_free_mem;
+	}
+	rcu_read_unlock();
+
+	for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
+		ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr));
+		if (ret) {
+			gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
+			return ret;
+		}
+		percpu_ref_get(ref);
+		put_page(virt_to_page(kaddr));
+		kaddr += PAGE_SIZE;
+		len -= PAGE_SIZE;
+	}
+
+	percpu_ref_put(ref);
+
+	return 0;
+out_free_mem:
+	gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
+static struct bin_attribute p2pmem_alloc_attr = {
+	.attr = { .name = "allocate", .mode = 0660 },
+	.mmap = p2pmem_alloc_mmap,
+	/*
+	 * Some places where we want to call mmap (ie. python) will check
+	 * that the file size is greater than the mmap size before allowing
+	 * the mmap to continue. To work around this, just set the size
+	 * to be very large.
+	 */
+	.size = SZ_1T,
+};
+
 static struct attribute *p2pmem_attrs[] = {
 	&dev_attr_size.attr,
 	&dev_attr_available.attr,
@@ -96,11 +180,32 @@ static struct attribute *p2pmem_attrs[] = {
 	NULL,
 };
 
+static struct bin_attribute *p2pmem_bin_attrs[] = {
+	&p2pmem_alloc_attr,
+	NULL,
+};
+
 static const struct attribute_group p2pmem_group = {
 	.attrs = p2pmem_attrs,
+	.bin_attrs = p2pmem_bin_attrs,
 	.name = "p2pmem",
 };
 
+static void p2pdma_page_free(struct page *page)
+{
+	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+	struct percpu_ref *ref;
+
+	gen_pool_free_owner(pgmap->provider->p2pdma->pool,
+			    (uintptr_t)page_to_virt(page), PAGE_SIZE,
+			    (void **)&ref);
+	percpu_ref_put(ref);
+}
+
+static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
+	.page_free = p2pdma_page_free,
+};
+
 static void pci_p2pdma_release(void *data)
 {
 	struct pci_dev *pdev = data;
@@ -152,6 +257,19 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 	return error;
 }
 
+static void pci_p2pdma_unmap_mappings(void *data)
+{
+	struct pci_dev *pdev = data;
+
+	/*
+	 * Removing the alloc attribute from sysfs will call
+	 * unmap_mapping_range() on the inode, teardown any existing userspace
+	 * mappings and prevent new ones from being created.
+	 */
+	sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
+				     p2pmem_group.name);
+}
+
 /**
  * pci_p2pdma_add_resource - add memory for use as p2p memory
  * @pdev: the device to add the memory to
@@ -198,6 +316,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->range.end = pgmap->range.start + size - 1;
 	pgmap->nr_range = 1;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+	pgmap->ops = &p2pdma_pgmap_ops;
 
 	p2p_pgmap->provider = pdev;
 	p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
@@ -209,6 +328,11 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		goto pgmap_free;
 	}
 
+	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
+					 pdev);
+	if (error)
+		goto pages_free;
+
 	p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
 	error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr,
 			pci_bus_address(pdev, bar) + offset,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v10 8/8] ABI: sysfs-bus-pci: add documentation for p2pmem allocate
  2022-09-22 16:39 [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (6 preceding siblings ...)
  2022-09-22 16:39 ` [PATCH v10 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs Logan Gunthorpe
@ 2022-09-22 16:39 ` Logan Gunthorpe
  2022-09-23  8:15   ` Greg Kroah-Hartman
  2022-09-23  6:01 ` [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Christoph Hellwig
  2022-09-23  8:16 ` Greg Kroah-Hartman
  9 siblings, 1 reply; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-22 16:39 UTC (permalink / raw)
  To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
  Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Ralph Campbell, Stephen Bates, Logan Gunthorpe

Add documentation for the p2pmem/allocate binary file which allows
for allocating p2pmem buffers in userspace for passing to drivers
that support them. (Currently only O_DIRECT to NVMe devices.)

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/ABI/testing/sysfs-bus-pci | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 6fc2c2efe8ab..f4602b8d6e11 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -407,6 +407,16 @@ Description:
 	        file contains a '1' if the memory has been published for
 		use outside the driver that owns the device.
 
+What:		/sys/bus/pci/devices/.../p2pmem/allocate
+Date:		August 2022
+Contact:	Logan Gunthorpe <logang@deltatee.com>
+Description:
+		This file allows mapping p2pmem into userspace. For each
+		mmap() call on this file, the kernel will allocate a chunk
+		of Peer-to-Peer memory for use in Peer-to-Peer transactions.
+		This memory can be used in O_DIRECT calls to NVMe backed
+		files for Peer-to-Peer copies.
+
 What:		/sys/bus/pci/devices/.../link/clkpm
 		/sys/bus/pci/devices/.../link/l0s_aspm
 		/sys/bus/pci/devices/.../link/l1_aspm
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs
  2022-09-22 16:39 ` [PATCH v10 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs Logan Gunthorpe
@ 2022-09-22 18:27   ` Bjorn Helgaas
  2022-09-23  8:15   ` Greg Kroah-Hartman
  1 sibling, 0 replies; 28+ messages in thread
From: Bjorn Helgaas @ 2022-09-22 18:27 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates

On Thu, Sep 22, 2022 at 10:39:25AM -0600, Logan Gunthorpe wrote:
> Create a sysfs bin attribute called "allocate" under the existing
> "p2pmem" group. The only allowable operation on this file is the mmap()
> call.
> 
> When mmap() is called on this attribute, the kernel allocates a chunk of
> memory from the genalloc and inserts the pages into the VMA. The
> dev_pagemap .page_free callback will indicate when these pages are no
> longer used and they will be put back into the genalloc.
> 
> On device unbind, remove the sysfs file before the memremap_pages are
> cleaned up. This ensures unmap_mapping_range() is called on the files
> inode and no new mappings can be created.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>

Not sure which tree this should go through, so:

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/p2pdma.c | 124 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 124 insertions(+)
> 
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4496a7c5c478..a6ed6bbca214 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -89,6 +89,90 @@ static ssize_t published_show(struct device *dev, struct device_attribute *attr,
>  }
>  static DEVICE_ATTR_RO(published);
>  
> +static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
> +		struct bin_attribute *attr, struct vm_area_struct *vma)
> +{
> +	struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
> +	size_t len = vma->vm_end - vma->vm_start;
> +	struct pci_p2pdma *p2pdma;
> +	struct percpu_ref *ref;
> +	unsigned long vaddr;
> +	void *kaddr;
> +	int ret;
> +
> +	/* prevent private mappings from being established */
> +	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
> +		pci_info_ratelimited(pdev,
> +				     "%s: fail, attempted private mapping\n",
> +				     current->comm);
> +		return -EINVAL;
> +	}
> +
> +	if (vma->vm_pgoff) {
> +		pci_info_ratelimited(pdev,
> +				     "%s: fail, attempted mapping with non-zero offset\n",
> +				     current->comm);
> +		return -EINVAL;
> +	}
> +
> +	rcu_read_lock();
> +	p2pdma = rcu_dereference(pdev->p2pdma);
> +	if (!p2pdma) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref);
> +	if (!kaddr) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/*
> +	 * vm_insert_page() can sleep, so a reference is taken to mapping
> +	 * such that rcu_read_unlock() can be done before inserting the
> +	 * pages
> +	 */
> +	if (unlikely(!percpu_ref_tryget_live_rcu(ref))) {
> +		ret = -ENODEV;
> +		goto out_free_mem;
> +	}
> +	rcu_read_unlock();
> +
> +	for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
> +		ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr));
> +		if (ret) {
> +			gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> +			return ret;
> +		}
> +		percpu_ref_get(ref);
> +		put_page(virt_to_page(kaddr));
> +		kaddr += PAGE_SIZE;
> +		len -= PAGE_SIZE;
> +	}
> +
> +	percpu_ref_put(ref);
> +
> +	return 0;
> +out_free_mem:
> +	gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> +out:
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> +static struct bin_attribute p2pmem_alloc_attr = {
> +	.attr = { .name = "allocate", .mode = 0660 },
> +	.mmap = p2pmem_alloc_mmap,
> +	/*
> +	 * Some places where we want to call mmap (ie. python) will check
> +	 * that the file size is greater than the mmap size before allowing
> +	 * the mmap to continue. To work around this, just set the size
> +	 * to be very large.
> +	 */
> +	.size = SZ_1T,
> +};
> +
>  static struct attribute *p2pmem_attrs[] = {
>  	&dev_attr_size.attr,
>  	&dev_attr_available.attr,
> @@ -96,11 +180,32 @@ static struct attribute *p2pmem_attrs[] = {
>  	NULL,
>  };
>  
> +static struct bin_attribute *p2pmem_bin_attrs[] = {
> +	&p2pmem_alloc_attr,
> +	NULL,
> +};
> +
>  static const struct attribute_group p2pmem_group = {
>  	.attrs = p2pmem_attrs,
> +	.bin_attrs = p2pmem_bin_attrs,
>  	.name = "p2pmem",
>  };
>  
> +static void p2pdma_page_free(struct page *page)
> +{
> +	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
> +	struct percpu_ref *ref;
> +
> +	gen_pool_free_owner(pgmap->provider->p2pdma->pool,
> +			    (uintptr_t)page_to_virt(page), PAGE_SIZE,
> +			    (void **)&ref);
> +	percpu_ref_put(ref);
> +}
> +
> +static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
> +	.page_free = p2pdma_page_free,
> +};
> +
>  static void pci_p2pdma_release(void *data)
>  {
>  	struct pci_dev *pdev = data;
> @@ -152,6 +257,19 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
>  	return error;
>  }
>  
> +static void pci_p2pdma_unmap_mappings(void *data)
> +{
> +	struct pci_dev *pdev = data;
> +
> +	/*
> +	 * Removing the alloc attribute from sysfs will call
> +	 * unmap_mapping_range() on the inode, teardown any existing userspace
> +	 * mappings and prevent new ones from being created.
> +	 */
> +	sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
> +				     p2pmem_group.name);
> +}
> +
>  /**
>   * pci_p2pdma_add_resource - add memory for use as p2p memory
>   * @pdev: the device to add the memory to
> @@ -198,6 +316,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  	pgmap->range.end = pgmap->range.start + size - 1;
>  	pgmap->nr_range = 1;
>  	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> +	pgmap->ops = &p2pdma_pgmap_ops;
>  
>  	p2p_pgmap->provider = pdev;
>  	p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
> @@ -209,6 +328,11 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
>  		goto pgmap_free;
>  	}
>  
> +	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
> +					 pdev);
> +	if (error)
> +		goto pages_free;
> +
>  	p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
>  	error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr,
>  			pci_bus_address(pdev, bar) + offset,
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices
  2022-09-22 16:39 [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (7 preceding siblings ...)
  2022-09-22 16:39 ` [PATCH v10 8/8] ABI: sysfs-bus-pci: add documentation for p2pmem allocate Logan Gunthorpe
@ 2022-09-23  6:01 ` Christoph Hellwig
  2022-09-23 15:25   ` Logan Gunthorpe
  2022-09-23  8:16 ` Greg Kroah-Hartman
  9 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2022-09-23  6:01 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
	Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
	Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
	Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
	Ralph Campbell, Stephen Bates

Thanks, the entire series looks good to me now:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Given that this is spread all over, what tree do we want to take it
through?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs
  2022-09-22 16:39 ` [PATCH v10 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs Logan Gunthorpe
  2022-09-22 18:27   ` Bjorn Helgaas
@ 2022-09-23  8:15   ` Greg Kroah-Hartman
  1 sibling, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2022-09-23  8:15 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates

On Thu, Sep 22, 2022 at 10:39:25AM -0600, Logan Gunthorpe wrote:
> Create a sysfs bin attribute called "allocate" under the existing
> "p2pmem" group. The only allowable operation on this file is the mmap()
> call.
> 
> When mmap() is called on this attribute, the kernel allocates a chunk of
> memory from the genalloc and inserts the pages into the VMA. The
> dev_pagemap .page_free callback will indicate when these pages are no
> longer used and they will be put back into the genalloc.
> 
> On device unbind, remove the sysfs file before the memremap_pages are
> cleaned up. This ensures unmap_mapping_range() is called on the files
> inode and no new mappings can be created.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  drivers/pci/p2pdma.c | 124 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 124 insertions(+)

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 8/8] ABI: sysfs-bus-pci: add documentation for p2pmem allocate
  2022-09-22 16:39 ` [PATCH v10 8/8] ABI: sysfs-bus-pci: add documentation for p2pmem allocate Logan Gunthorpe
@ 2022-09-23  8:15   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2022-09-23  8:15 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates

On Thu, Sep 22, 2022 at 10:39:26AM -0600, Logan Gunthorpe wrote:
> Add documentation for the p2pmem/allocate binary file which allows
> for allocating p2pmem buffers in userspace for passing to drivers
> that support them. (Currently only O_DIRECT to NVMe devices.)
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-pci | 10 ++++++++++
>  1 file changed, 10 insertions(+)

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices
  2022-09-22 16:39 [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
                   ` (8 preceding siblings ...)
  2022-09-23  6:01 ` [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Christoph Hellwig
@ 2022-09-23  8:16 ` Greg Kroah-Hartman
  9 siblings, 0 replies; 28+ messages in thread
From: Greg Kroah-Hartman @ 2022-09-23  8:16 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates

On Thu, Sep 22, 2022 at 10:39:18AM -0600, Logan Gunthorpe wrote:
> Hi,
> 
> This is the latest P2PDMA userspace patch set. This version includes
> some cleanup from feedback of the last posting[1].
> 
> This patch set enables userspace P2PDMA by allowing userspace to mmap()
> allocated chunks of the CMB. The resulting VMA can be passed only
> to O_DIRECT IO on NVMe backed files or block devices. A flag is added
> to GUP() in Patch 1, then Patches 2 through 6 wire this flag up based
> on whether the block queue indicates P2PDMA support. Patches 7
> creates the sysfs resource that can hand out the VMAs and Patch 8
> adds brief documentation for the new interface.
> 
> Feedback welcome.
> 
> This series is based on v6.0-rc6. A git branch is available here:
> 
>   https://github.com/sbates130272/linux-p2pmem/  p2pdma_user_cmb_v10

Looks good to me, thanks for sticking with it.

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices
  2022-09-23  6:01 ` [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Christoph Hellwig
@ 2022-09-23 15:25   ` Logan Gunthorpe
  0 siblings, 0 replies; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-23 15:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Greg Kroah-Hartman, Dan Williams, Jason Gunthorpe,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates




On 2022-09-23 00:01, Christoph Hellwig wrote:
> Thanks, the entire series looks good to me now:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Given that this is spread all over, what tree do we want to take it
> through?

Yes, while this is ostensibly a feature for NVMe it turns out we didn't
need to touch any NVMe code at all.

The most likely patch in my mind to have conflicts is the iov_iter patch
as there's been a lot of churn there in the last few cycles and there
are continued discussions.

There are 2 PCI patches, but Bjorn's aware of them and has acked them.
I'm also fairly confident this shouldn't conflict with anything in his tree.

Besides that, there is one mm/gup patch which is the next likely to
conflict; one scatterlist patch and three block layer patches which have
largely been stable when I've done rebases.

Logan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-22 16:39 ` [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
@ 2022-09-23 18:13   ` Jason Gunthorpe
  2022-09-23 19:08     ` Logan Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 18:13 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates

On Thu, Sep 22, 2022 at 10:39:19AM -0600, Logan Gunthorpe wrote:
> GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
> allow obtaining P2PDMA pages. If GUP is called without the flag and a
> P2PDMA page is found, it will return an error.
> 
> FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set.

What is causing this? It is really troublesome, I would like to fix
it. eg I would like to have P2PDMA pages in VFIO iommu page tables and
in RDMA MR's - both require longterm.

Is it just because ZONE_DEVICE was created for DAX and carried that
revocable assumption over? Does anything in your series require
revocable?

> @@ -2383,6 +2392,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  		page = pte_page(pte);
>  
> +		if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
> +			     is_pci_p2pdma_page(page)))
> +			goto pte_unmap;
> +
>  		folio = try_grab_folio(page, 1, flags);
>  		if (!folio)
>  			goto pte_unmap;

On closer look this is not in the right place, we cannot touch the
content of *page without holding a ref, and that doesn't happen until
until try_grab_folio() completes.

It would be simpler to put this check in try_grab_folio/try_grab_page
after the ref has been obtained. That will naturally cover all the
places that need it.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-23 18:13   ` Jason Gunthorpe
@ 2022-09-23 19:08     ` Logan Gunthorpe
  2022-09-23 19:53       ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-23 19:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates



On 2022-09-23 12:13, Jason Gunthorpe wrote:
> On Thu, Sep 22, 2022 at 10:39:19AM -0600, Logan Gunthorpe wrote:
>> GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
>> allow obtaining P2PDMA pages. If GUP is called without the flag and a
>> P2PDMA page is found, it will return an error.
>>
>> FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set.
> 
> What is causing this? It is really troublesome, I would like to fix
> it. eg I would like to have P2PDMA pages in VFIO iommu page tables and
> in RDMA MR's - both require longterm.

You had said it was required if we were relying on unmap_mapping_range()...

https://lore.kernel.org/all/20210928200506.GX3544071@ziepe.ca/T/#u

> Is it just because ZONE_DEVICE was created for DAX and carried that
> revocable assumption over? Does anything in your series require
> revocable?

We still rely on unmap_mapping_range() indirectly in the unbind path.
So I expect if something takes a LONGERM mapping that would block until
whatever process holds the pin releases it. That's less than ideal and
I'm not sure what can be done about it.

>> @@ -2383,6 +2392,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>>  		page = pte_page(pte);
>>  
>> +		if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
>> +			     is_pci_p2pdma_page(page)))
>> +			goto pte_unmap;
>> +
>>  		folio = try_grab_folio(page, 1, flags);
>>  		if (!folio)
>>  			goto pte_unmap;
> 
> On closer look this is not in the right place, we cannot touch the
> content of *page without holding a ref, and that doesn't happen until
> until try_grab_folio() completes.
> 
> It would be simpler to put this check in try_grab_folio/try_grab_page
> after the ref has been obtained. That will naturally cover all the
> places that need it.

Ok, I can make that change.

Logan



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-23 19:08     ` Logan Gunthorpe
@ 2022-09-23 19:53       ` Jason Gunthorpe
  2022-09-23 20:11         ` Logan Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 19:53 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates

On Fri, Sep 23, 2022 at 01:08:31PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2022-09-23 12:13, Jason Gunthorpe wrote:
> > On Thu, Sep 22, 2022 at 10:39:19AM -0600, Logan Gunthorpe wrote:
> >> GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
> >> allow obtaining P2PDMA pages. If GUP is called without the flag and a
> >> P2PDMA page is found, it will return an error.
> >>
> >> FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set.
> > 
> > What is causing this? It is really troublesome, I would like to fix
> > it. eg I would like to have P2PDMA pages in VFIO iommu page tables and
> > in RDMA MR's - both require longterm.
> 
> You had said it was required if we were relying on unmap_mapping_range()...

Ah.. Ok.  Dan and I have been talking about this a lot, and it turns
out the DAX approach of unmap_mapping_range() still has problems,
really the same problem as FOLL_LONGTERM:

https://lore.kernel.org/all/Yy2pC%2FupZNEkVmc5@nvidia.com/

ie nothing actually waits for the page refs to go to zero during
memunmap_pages(). (indeed they are not actually zero because currently
they are instantly reset to 1 if they become zero)

The current design requires that the pgmap user hold the pgmap_ref in
a way that it remains elevated until page_free() is called for every
page that was ever used.

I'm encouraging Dan to work on better infrastructure in pgmap core
because every pgmap implementation has this issue currently.

For that reason it is probably not so relavent to this series.

Perhaps just clarify in the commit message that the FOLL_LONGTERM
restriction is to copy DAX until the pgmap page refcounts are fixed.

> > Is it just because ZONE_DEVICE was created for DAX and carried that
> > revocable assumption over? Does anything in your series require
> > revocable?
> 
> We still rely on unmap_mapping_range() indirectly in the unbind
> path. So I expect if something takes a LONGERM mapping that would
> block until whatever process holds the pin releases it. That's less
> than ideal and I'm not sure what can be done about it.

We could improve the blocking with some kind of FOLL_LONGTERM notifier
thingy eg after the unmap_mapping_rage() broadcast that a range of
PFNs is going away and FOLL_LONGTERM users can do a revoke if they
support it. It is a rare enough we don't necessarily need to optimize
this alot, and blocking unbind until some FDs close is annoying not
critical.. (eg you already can't unmount a filesystem to unbind the
device on the nvme while FS FDs are open)

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-23 19:53       ` Jason Gunthorpe
@ 2022-09-23 20:11         ` Logan Gunthorpe
  2022-09-23 22:58           ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-23 20:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates



On 2022-09-23 13:53, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 01:08:31PM -0600, Logan Gunthorpe wrote:
> I'm encouraging Dan to work on better infrastructure in pgmap core
> because every pgmap implementation has this issue currently.
> 
> For that reason it is probably not so relavent to this series.
> 
> Perhaps just clarify in the commit message that the FOLL_LONGTERM
> restriction is to copy DAX until the pgmap page refcounts are fixed.

Ok, I'll add that note.

Per the fix for the try_grab_page(), to me it doesn't fit well in 
try_grab_page() without doing a bunch of cleanup to change the
error handling, and the same would have to be added to try_grab_folio().
So I think it's better to leave it where it was, but move it below the 
respective grab calls. Does the incremental patch below look correct?

I am confused about what happens if neither FOLL_PIN or FOLL_GET 
are set (which the documentation for try_grab_x() says is possible, but
other documentation suggests that FOLL_GET is automatically set). 
In which case it'd be impossible to do the check if we can't 
access the page.

I'm assuming that seeing there are other accesses to the page in these
two instances of try_grab_x() that these spots will always have FOLL_GET
or  FOLL_PIN set and thus this isn't an issue. Another reason not
to push the check into try_grab_x().

Logan

--

diff --git a/mm/gup.c b/mm/gup.c
index 108848b67f6f..f05ba3e8e29a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -601,12 +601,6 @@ static struct page *follow_page_pte(struct vm_area_struct >
                goto out;
        }
 
-       if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
-                    is_pci_p2pdma_page(page))) {
-               page = ERR_PTR(-EREMOTEIO);
-               goto out;
-       }
-
        VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
                       !PageAnonExclusive(page), page);
 
@@ -615,6 +609,13 @@ static struct page *follow_page_pte(struct vm_area_struct >
                page = ERR_PTR(-ENOMEM);
                goto out;
        }
+
+       if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) {
+               gup_put_folio(page_folio(page), 1, flags);
+               page = ERR_PTR(-EREMOTEIO);
+               goto out;
+       }
+
        /*
         * We need to make the page accessible if and only if we are going
         * to access its content (the FOLL_PIN case).  Please see
@@ -2392,14 +2393,16 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr,>
                VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
                page = pte_page(pte);
 
-               if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
-                            is_pci_p2pdma_page(page)))
-                       goto pte_unmap;
-
                folio = try_grab_folio(page, 1, flags);
                if (!folio)
                        goto pte_unmap;
 
+               if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+                            is_pci_p2pdma_page(page))) {
+                       gup_put_folio(folio, 1, flags);
+                       goto pte_unmap;
+               }
+
                if (unlikely(page_is_secretmem(page))) {
                        gup_put_folio(folio, 1, flags);
                        goto pte_unmap;

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-23 20:11         ` Logan Gunthorpe
@ 2022-09-23 22:58           ` Jason Gunthorpe
  2022-09-23 23:01             ` Logan Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 22:58 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates

On Fri, Sep 23, 2022 at 02:11:03PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2022-09-23 13:53, Jason Gunthorpe wrote:
> > On Fri, Sep 23, 2022 at 01:08:31PM -0600, Logan Gunthorpe wrote:
> > I'm encouraging Dan to work on better infrastructure in pgmap core
> > because every pgmap implementation has this issue currently.
> > 
> > For that reason it is probably not so relavent to this series.
> > 
> > Perhaps just clarify in the commit message that the FOLL_LONGTERM
> > restriction is to copy DAX until the pgmap page refcounts are fixed.
> 
> Ok, I'll add that note.
> 
> Per the fix for the try_grab_page(), to me it doesn't fit well in 
> try_grab_page() without doing a bunch of cleanup to change the
> error handling, and the same would have to be added to try_grab_folio().
> So I think it's better to leave it where it was, but move it below the 
> respective grab calls. Does the incremental patch below look correct?

Oh? I was thinking of just a very simple thing:

--- a/mm/gup.c
+++ b/mm/gup.c
@@ -225,6 +225,11 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
                node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
        }
 
+       if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) {
+               gup_put_folio(page_folio(page), 1, flags);
+              return false;
+       }
+
        return true;
 }


> I am confused about what happens if neither FOLL_PIN or FOLL_GET 
> are set (which the documentation for try_grab_x() says is possible, but
> other documentation suggests that FOLL_GET is automatically set). 
> In which case it'd be impossible to do the check if we can't 
> access the page.

try_grab_page is operating under the PTL so it can probably touch the
page OK (though perhaps we don't need to even check anything)

try_grab_folio cannot be called without PIN/GET, so like this perhaps:

@@ -123,11 +123,14 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
  */
 struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 {
+       struct folio *folio;
+
+       if (WARN_ON((flags & (FOLL_GET | FOLL_PIN)) == 0))
+               return NULL;
+
        if (flags & FOLL_GET)
-               return try_get_folio(page, refs);
+               folio = try_get_folio(page, refs);
        else if (flags & FOLL_PIN) {
-               struct folio *folio;
-
                /*
                 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
                 * right zone, so fail and let the caller fall back to the slow
@@ -160,11 +163,14 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
                                        refs * (GUP_PIN_COUNTING_BIAS - 1));
                node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, refs);
 
-               return folio;
        }
 
-       WARN_ON_ONCE(1);
-       return NULL;
+       if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page))) {
+               gup_put_folio(page, 1, flags);
+               return NULL;
+       }
+
+       return folio;
 }

Jason


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-23 22:58           ` Jason Gunthorpe
@ 2022-09-23 23:01             ` Logan Gunthorpe
  2022-09-23 23:07               ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-23 23:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates




On 2022-09-23 16:58, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 02:11:03PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2022-09-23 13:53, Jason Gunthorpe wrote:
>>> On Fri, Sep 23, 2022 at 01:08:31PM -0600, Logan Gunthorpe wrote:
>>> I'm encouraging Dan to work on better infrastructure in pgmap core
>>> because every pgmap implementation has this issue currently.
>>>
>>> For that reason it is probably not so relavent to this series.
>>>
>>> Perhaps just clarify in the commit message that the FOLL_LONGTERM
>>> restriction is to copy DAX until the pgmap page refcounts are fixed.
>>
>> Ok, I'll add that note.
>>
>> Per the fix for the try_grab_page(), to me it doesn't fit well in 
>> try_grab_page() without doing a bunch of cleanup to change the
>> error handling, and the same would have to be added to try_grab_folio().
>> So I think it's better to leave it where it was, but move it below the 
>> respective grab calls. Does the incremental patch below look correct?
> 
> Oh? I was thinking of just a very simple thing:

Really would like it to return -EREMOTEIO instead of -ENOMEM as that's the
error used for bad P2PDMA page everywhere.

Plus the concern that some of the callsites of try_grab_page() might not have
a get or a pin and thus it's not safe which was the whole point of the change
anyway.

Plus we have to do the same for try_grab_folio().

Logan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-23 23:01             ` Logan Gunthorpe
@ 2022-09-23 23:07               ` Jason Gunthorpe
  2022-09-23 23:14                 ` Logan Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 23:07 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates

On Fri, Sep 23, 2022 at 05:01:26PM -0600, Logan Gunthorpe wrote:
> 
> 
> 
> On 2022-09-23 16:58, Jason Gunthorpe wrote:
> > On Fri, Sep 23, 2022 at 02:11:03PM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2022-09-23 13:53, Jason Gunthorpe wrote:
> >>> On Fri, Sep 23, 2022 at 01:08:31PM -0600, Logan Gunthorpe wrote:
> >>> I'm encouraging Dan to work on better infrastructure in pgmap core
> >>> because every pgmap implementation has this issue currently.
> >>>
> >>> For that reason it is probably not so relavent to this series.
> >>>
> >>> Perhaps just clarify in the commit message that the FOLL_LONGTERM
> >>> restriction is to copy DAX until the pgmap page refcounts are fixed.
> >>
> >> Ok, I'll add that note.
> >>
> >> Per the fix for the try_grab_page(), to me it doesn't fit well in 
> >> try_grab_page() without doing a bunch of cleanup to change the
> >> error handling, and the same would have to be added to try_grab_folio().
> >> So I think it's better to leave it where it was, but move it below the 
> >> respective grab calls. Does the incremental patch below look correct?
> > 
> > Oh? I was thinking of just a very simple thing:
> 
> Really would like it to return -EREMOTEIO instead of -ENOMEM as that's the
> error used for bad P2PDMA page everywhere.

I'd rather not see GUP made more fragile just for that..

> Plus the concern that some of the callsites of try_grab_page() might not have
> a get or a pin and thus it's not safe which was the whole point of the change
> anyway.

try_grab_page() calls folio_ref_inc(), that is only legal if it knows
the page is already a valid pointer under the PTLs, so it is safe to
check the pgmap as well.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-23 23:07               ` Jason Gunthorpe
@ 2022-09-23 23:14                 ` Logan Gunthorpe
  2022-09-23 23:21                   ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-23 23:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates



On 2022-09-23 17:07, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 05:01:26PM -0600, Logan Gunthorpe wrote:
>>
>>
>>
>> On 2022-09-23 16:58, Jason Gunthorpe wrote:
>>> On Fri, Sep 23, 2022 at 02:11:03PM -0600, Logan Gunthorpe wrote:
>>>>
>>>>
>>>> On 2022-09-23 13:53, Jason Gunthorpe wrote:
>>>>> On Fri, Sep 23, 2022 at 01:08:31PM -0600, Logan Gunthorpe wrote:
>>>>> I'm encouraging Dan to work on better infrastructure in pgmap core
>>>>> because every pgmap implementation has this issue currently.
>>>>>
>>>>> For that reason it is probably not so relavent to this series.
>>>>>
>>>>> Perhaps just clarify in the commit message that the FOLL_LONGTERM
>>>>> restriction is to copy DAX until the pgmap page refcounts are fixed.
>>>>
>>>> Ok, I'll add that note.
>>>>
>>>> Per the fix for the try_grab_page(), to me it doesn't fit well in 
>>>> try_grab_page() without doing a bunch of cleanup to change the
>>>> error handling, and the same would have to be added to try_grab_folio().
>>>> So I think it's better to leave it where it was, but move it below the 
>>>> respective grab calls. Does the incremental patch below look correct?
>>>
>>> Oh? I was thinking of just a very simple thing:
>>
>> Really would like it to return -EREMOTEIO instead of -ENOMEM as that's the
>> error used for bad P2PDMA page everywhere.
> 
> I'd rather not see GUP made more fragile just for that..

Not sure how that's more fragile... You're way seems more dangerous given
the large number of call sites we are adding it to when it might not apply.

> 
>> Plus the concern that some of the callsites of try_grab_page() might not have
>> a get or a pin and thus it's not safe which was the whole point of the change
>> anyway.
> 
> try_grab_page() calls folio_ref_inc(), that is only legal if it knows
> the page is already a valid pointer under the PTLs, so it is safe to
> check the pgmap as well.

My point is it doesn't get a reference or a pin unless FOLL_PIN or FOLL_GET is
set and the documentation states that neither might be set, in which case 
folio_ref_inc() will not be called...


Logan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-23 23:14                 ` Logan Gunthorpe
@ 2022-09-23 23:21                   ` Jason Gunthorpe
  2022-09-23 23:35                     ` Logan Gunthorpe
  2022-09-23 23:51                     ` Logan Gunthorpe
  0 siblings, 2 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 23:21 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates

On Fri, Sep 23, 2022 at 05:14:11PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2022-09-23 17:07, Jason Gunthorpe wrote:
> > On Fri, Sep 23, 2022 at 05:01:26PM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >>
> >> On 2022-09-23 16:58, Jason Gunthorpe wrote:
> >>> On Fri, Sep 23, 2022 at 02:11:03PM -0600, Logan Gunthorpe wrote:
> >>>>
> >>>>
> >>>> On 2022-09-23 13:53, Jason Gunthorpe wrote:
> >>>>> On Fri, Sep 23, 2022 at 01:08:31PM -0600, Logan Gunthorpe wrote:
> >>>>> I'm encouraging Dan to work on better infrastructure in pgmap core
> >>>>> because every pgmap implementation has this issue currently.
> >>>>>
> >>>>> For that reason it is probably not so relavent to this series.
> >>>>>
> >>>>> Perhaps just clarify in the commit message that the FOLL_LONGTERM
> >>>>> restriction is to copy DAX until the pgmap page refcounts are fixed.
> >>>>
> >>>> Ok, I'll add that note.
> >>>>
> >>>> Per the fix for the try_grab_page(), to me it doesn't fit well in 
> >>>> try_grab_page() without doing a bunch of cleanup to change the
> >>>> error handling, and the same would have to be added to try_grab_folio().
> >>>> So I think it's better to leave it where it was, but move it below the 
> >>>> respective grab calls. Does the incremental patch below look correct?
> >>>
> >>> Oh? I was thinking of just a very simple thing:
> >>
> >> Really would like it to return -EREMOTEIO instead of -ENOMEM as that's the
> >> error used for bad P2PDMA page everywhere.
> > 
> > I'd rather not see GUP made more fragile just for that..
> 
> Not sure how that's more fragile... You're way seems more dangerous given
> the large number of call sites we are adding it to when it might not
> apply.

No, that is the point, it *always* applies. A devmap struct page of
the wrong type should never exit gup, from any path, no matter what.

We have two central functions that validate a page is OK to return,
that *everyone* must call.

If you don't put it there then we will probably miss copying it into a
call site eventually.

> > try_grab_page() calls folio_ref_inc(), that is only legal if it knows
> > the page is already a valid pointer under the PTLs, so it is safe to
> > check the pgmap as well.
> 
> My point is it doesn't get a reference or a pin unless FOLL_PIN or FOLL_GET is
> set and the documentation states that neither might be set, in which case 
> folio_ref_inc() will not be called...

That isn't how GUP is structured, all the calls to try_grab_page() are
in places where PIN/GET might be set and are safe for that usage.

If we know PIN/GET is not set then we don't even need to call the
function because it is a NOP.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-23 23:21                   ` Jason Gunthorpe
@ 2022-09-23 23:35                     ` Logan Gunthorpe
  2022-09-23 23:51                     ` Logan Gunthorpe
  1 sibling, 0 replies; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-23 23:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates



On 2022-09-23 17:21, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 05:14:11PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2022-09-23 17:07, Jason Gunthorpe wrote:
>>> On Fri, Sep 23, 2022 at 05:01:26PM -0600, Logan Gunthorpe wrote:
>>>>
>>>>
>>>>
>>>> On 2022-09-23 16:58, Jason Gunthorpe wrote:
>>>>> On Fri, Sep 23, 2022 at 02:11:03PM -0600, Logan Gunthorpe wrote:
>>>>>>
>>>>>>
>>>>>> On 2022-09-23 13:53, Jason Gunthorpe wrote:
>>>>>>> On Fri, Sep 23, 2022 at 01:08:31PM -0600, Logan Gunthorpe wrote:
>>>>>>> I'm encouraging Dan to work on better infrastructure in pgmap core
>>>>>>> because every pgmap implementation has this issue currently.
>>>>>>>
>>>>>>> For that reason it is probably not so relavent to this series.
>>>>>>>
>>>>>>> Perhaps just clarify in the commit message that the FOLL_LONGTERM
>>>>>>> restriction is to copy DAX until the pgmap page refcounts are fixed.
>>>>>>
>>>>>> Ok, I'll add that note.
>>>>>>
>>>>>> Per the fix for the try_grab_page(), to me it doesn't fit well in 
>>>>>> try_grab_page() without doing a bunch of cleanup to change the
>>>>>> error handling, and the same would have to be added to try_grab_folio().
>>>>>> So I think it's better to leave it where it was, but move it below the 
>>>>>> respective grab calls. Does the incremental patch below look correct?
>>>>>
>>>>> Oh? I was thinking of just a very simple thing:
>>>>
>>>> Really would like it to return -EREMOTEIO instead of -ENOMEM as that's the
>>>> error used for bad P2PDMA page everywhere.
>>>
>>> I'd rather not see GUP made more fragile just for that..
>>
>> Not sure how that's more fragile... You're way seems more dangerous given
>> the large number of call sites we are adding it to when it might not
>> apply.
> 
> No, that is the point, it *always* applies. A devmap struct page of
> the wrong type should never exit gup, from any path, no matter what.
> 
> We have two central functions that validate a page is OK to return,
> that *everyone* must call.
> 
> If you don't put it there then we will probably miss copying it into a
> call site eventually.

Most of the call sites don't apply though, with huge pages and gate pages...

>>> try_grab_page() calls folio_ref_inc(), that is only legal if it knows
>>> the page is already a valid pointer under the PTLs, so it is safe to
>>> check the pgmap as well.
>>
>> My point is it doesn't get a reference or a pin unless FOLL_PIN or FOLL_GET is
>> set and the documentation states that neither might be set, in which case 
>> folio_ref_inc() will not be called...
> 
> That isn't how GUP is structured, all the calls to try_grab_page() are
> in places where PIN/GET might be set and are safe for that usage.
> 
> If we know PIN/GET is not set then we don't even need to call the
> function because it is a NOP.

That's not what the documentation for the function says:

"Either FOLL_PIN or FOLL_GET (or neither) may be set... Return: true for success, 
 or if no action was required (if neither FOLL_PIN nor FOLL_GET was set, nothing 
 is done)."

https://elixir.bootlin.com/linux/v6.0-rc6/source/mm/gup.c#L194

Logan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-23 23:21                   ` Jason Gunthorpe
  2022-09-23 23:35                     ` Logan Gunthorpe
@ 2022-09-23 23:51                     ` Logan Gunthorpe
  2022-09-26 22:57                       ` Jason Gunthorpe
  1 sibling, 1 reply; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-23 23:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates



On 2022-09-23 17:21, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 05:14:11PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2022-09-23 17:07, Jason Gunthorpe wrote:
>>> On Fri, Sep 23, 2022 at 05:01:26PM -0600, Logan Gunthorpe wrote:
>>>>
>>>>
>>>>
>>>> On 2022-09-23 16:58, Jason Gunthorpe wrote:
>>>>> On Fri, Sep 23, 2022 at 02:11:03PM -0600, Logan Gunthorpe wrote:
>>>>>>
>>>>>>
>>>>>> On 2022-09-23 13:53, Jason Gunthorpe wrote:
>>>>>>> On Fri, Sep 23, 2022 at 01:08:31PM -0600, Logan Gunthorpe wrote:
>>>>>>> I'm encouraging Dan to work on better infrastructure in pgmap core
>>>>>>> because every pgmap implementation has this issue currently.
>>>>>>>
>>>>>>> For that reason it is probably not so relavent to this series.
>>>>>>>
>>>>>>> Perhaps just clarify in the commit message that the FOLL_LONGTERM
>>>>>>> restriction is to copy DAX until the pgmap page refcounts are fixed.
>>>>>>
>>>>>> Ok, I'll add that note.
>>>>>>
>>>>>> Per the fix for the try_grab_page(), to me it doesn't fit well in 
>>>>>> try_grab_page() without doing a bunch of cleanup to change the
>>>>>> error handling, and the same would have to be added to try_grab_folio().
>>>>>> So I think it's better to leave it where it was, but move it below the 
>>>>>> respective grab calls. Does the incremental patch below look correct?
>>>>>
>>>>> Oh? I was thinking of just a very simple thing:
>>>>
>>>> Really would like it to return -EREMOTEIO instead of -ENOMEM as that's the
>>>> error used for bad P2PDMA page everywhere.
>>>
>>> I'd rather not see GUP made more fragile just for that..

And on further consideration I really think the correct error return is 
important here. This will be a user facing error that'll be easy enough
to hit: think code that might be run on any file and if the file is 
hosted on a block device that doesn't support P2PDMA then the user
will see the very uninformative "Cannot allocate memory" error.

Userspace code that's written for purpose can look at the EREMOTEIO error
and tell the user something useful, if we return the correct error.
If we return ENOMEM in this case, that is not possible because
lots of things might have caused that error.

Logan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-23 23:51                     ` Logan Gunthorpe
@ 2022-09-26 22:57                       ` Jason Gunthorpe
  2022-09-28 21:38                         ` Logan Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2022-09-26 22:57 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates

On Fri, Sep 23, 2022 at 05:51:49PM -0600, Logan Gunthorpe wrote:

> And on further consideration I really think the correct error return is 
> important here. This will be a user facing error that'll be easy enough
> to hit: think code that might be run on any file and if the file is 
> hosted on a block device that doesn't support P2PDMA then the user
> will see the very uninformative "Cannot allocate memory" error.
> 
> Userspace code that's written for purpose can look at the EREMOTEIO error
> and tell the user something useful, if we return the correct error.
> If we return ENOMEM in this case, that is not possible because
> lots of things might have caused that error.

That is reasonable, but I'd still prefer to see it done more
centrally.

>> If we know PIN/GET is not set then we don't even need to call the
>> function because it is a NOP.

> That's not what the documentation for the function says:

> "Either FOLL_PIN or FOLL_GET (or neither) may be set... Return: true for success,
>  or if no action was required (if neither FOLL_PIN nor FOLL_GET was set, nothing
>  is done)."

I mean the way the code is structured is at the top of the call chain
the PIN/GET/0 is decided and then the callchain is run. All the
callsites of try_grab_page() must be safe to call under FOLL_PIN
because their caller is making the decision what flag to use.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  2022-09-26 22:57                       ` Jason Gunthorpe
@ 2022-09-28 21:38                         ` Logan Gunthorpe
  0 siblings, 0 replies; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-28 21:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
	Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
	Christian König, John Hubbard, Don Dutile, Matthew Wilcox,
	Daniel Vetter, Minturn Dave B, Jason Ekstrand, Dave Hansen,
	Xiong Jianxin, Bjorn Helgaas, Ira Weiny, Robin Murphy,
	Martin Oliveira, Chaitanya Kulkarni, Ralph Campbell,
	Stephen Bates



On 2022-09-26 16:57, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 05:51:49PM -0600, Logan Gunthorpe wrote:
>> Userspace code that's written for purpose can look at the EREMOTEIO error
>> and tell the user something useful, if we return the correct error.
>> If we return ENOMEM in this case, that is not possible because
>> lots of things might have caused that error.
> 
> That is reasonable, but I'd still prefer to see it done more
> centrally.
> 
> I mean the way the code is structured is at the top of the call chain
> the PIN/GET/0 is decided and then the callchain is run. All the
> callsites of try_grab_page() must be safe to call under FOLL_PIN
> because their caller is making the decision what flag to use.

Ok, so I've done some auditing here.

I've convinced myself it's safe to access the page before incrementing
the reference:

 * In the try_grab_page() case it must be safe as all call sites do seem
to be called under the appropriate ptl or mmap_lock (though this is hard
to audit). It's also true that it touches the page struct in the sense
of the reference.
 * In the try_grab_folio() case there already is already a similar
FOLL_LONGTERM check in that function *before* getting the reference and
the page should be stable due to the existing gup fast guarantees.

So we don't need to do the check after we have the reference and release
it when it fails. This simplifies things.

Moving the check into try_grab_x() should be possible with some cleanup.

For try_grab_page(), there are a few call sites that WARN_ON if it
fails, assuming it cannot fail seeing the page is stable.
try_grab_page() already has a WARN_ON on failure so it appears fine to
remove the second WARN_ON and add a new failure path that doesn't WARN.

For try_grab_folio() there's one call site in follow_hugetlb_page() that
assumes success and warns on failure; but this call site only applies to
hugetlb pages which should never be P2PDMA pages (nor non-longterm pages
which is another existing failure path). So I've added a note in the
comment with a couple other conditions that should not be possible.

I expect this work is way too late for the merge window now so I'll send
v11 after the window. In the meantime, if you want to do a quick review
on the first two patches, it would speed things up if there are obvious
changes. You can see these patches on this git branch:

  https://github.com/sbates130272/linux-p2pmem/  p2pdma_user_cmb_v11pre

Thanks,

Logan




^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-09-28 21:38 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-22 16:39 [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
2022-09-22 16:39 ` [PATCH v10 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
2022-09-23 18:13   ` Jason Gunthorpe
2022-09-23 19:08     ` Logan Gunthorpe
2022-09-23 19:53       ` Jason Gunthorpe
2022-09-23 20:11         ` Logan Gunthorpe
2022-09-23 22:58           ` Jason Gunthorpe
2022-09-23 23:01             ` Logan Gunthorpe
2022-09-23 23:07               ` Jason Gunthorpe
2022-09-23 23:14                 ` Logan Gunthorpe
2022-09-23 23:21                   ` Jason Gunthorpe
2022-09-23 23:35                     ` Logan Gunthorpe
2022-09-23 23:51                     ` Logan Gunthorpe
2022-09-26 22:57                       ` Jason Gunthorpe
2022-09-28 21:38                         ` Logan Gunthorpe
2022-09-22 16:39 ` [PATCH v10 2/8] iov_iter: introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
2022-09-22 16:39 ` [PATCH v10 3/8] block: add check when merging zone device pages Logan Gunthorpe
2022-09-22 16:39 ` [PATCH v10 4/8] lib/scatterlist: " Logan Gunthorpe
2022-09-22 16:39 ` [PATCH v10 5/8] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
2022-09-22 16:39 ` [PATCH v10 6/8] block: set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
2022-09-22 16:39 ` [PATCH v10 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs Logan Gunthorpe
2022-09-22 18:27   ` Bjorn Helgaas
2022-09-23  8:15   ` Greg Kroah-Hartman
2022-09-22 16:39 ` [PATCH v10 8/8] ABI: sysfs-bus-pci: add documentation for p2pmem allocate Logan Gunthorpe
2022-09-23  8:15   ` Greg Kroah-Hartman
2022-09-23  6:01 ` [PATCH v10 0/8] Userspace P2PDMA with O_DIRECT NVMe devices Christoph Hellwig
2022-09-23 15:25   ` Logan Gunthorpe
2022-09-23  8:16 ` Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).