* [PATCH v11 1/9] mm: allow multiple error returns in try_grab_page()
2022-10-21 17:41 [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
@ 2022-10-21 17:41 ` Logan Gunthorpe
2022-10-24 15:00 ` Christoph Hellwig
` (2 more replies)
2022-10-21 17:41 ` [PATCH v11 2/9] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
` (9 subsequent siblings)
10 siblings, 3 replies; 35+ messages in thread
From: Logan Gunthorpe @ 2022-10-21 17:41 UTC (permalink / raw)
To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, Logan Gunthorpe
In order to add checks for P2PDMA memory into try_grab_page(), expand
the error return from a bool to an int/error code. Update all the
callsites handle change in usage.
Also remove the WARN_ON_ONCE() call at the callsites seeing there
already is a WARN_ON_ONCE() inside the function if it fails.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
include/linux/mm.h | 2 +-
mm/gup.c | 26 ++++++++++++++------------
mm/huge_memory.c | 19 +++++++++++++------
mm/hugetlb.c | 17 +++++++++--------
4 files changed, 37 insertions(+), 27 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..62a91dc1272b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1129,7 +1129,7 @@ static inline void get_page(struct page *page)
folio_get(page_folio(page));
}
-bool __must_check try_grab_page(struct page *page, unsigned int flags);
+int __must_check try_grab_page(struct page *page, unsigned int flags);
static inline __must_check bool try_get_page(struct page *page)
{
diff --git a/mm/gup.c b/mm/gup.c
index fe195d47de74..e2f447446384 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -202,17 +202,19 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
* time. Cases: please see the try_grab_folio() documentation, with
* "refs=1".
*
- * Return: true for success, or if no action was required (if neither FOLL_PIN
- * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or
- * FOLL_PIN was set, but the page could not be grabbed.
+ * Return: 0 for success, or if no action was required (if neither FOLL_PIN
+ * nor FOLL_GET was set, nothing is done). A negative error code for failure:
+ *
+ * -ENOMEM FOLL_GET or FOLL_PIN was set, but the page could not
+ * be grabbed.
*/
-bool __must_check try_grab_page(struct page *page, unsigned int flags)
+int __must_check try_grab_page(struct page *page, unsigned int flags)
{
struct folio *folio = page_folio(page);
WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == (FOLL_GET | FOLL_PIN));
if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
- return false;
+ return -ENOMEM;
if (flags & FOLL_GET)
folio_ref_inc(folio);
@@ -232,7 +234,7 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
}
- return true;
+ return 0;
}
/**
@@ -624,8 +626,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
!PageAnonExclusive(page), page);
/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
- if (unlikely(!try_grab_page(page, flags))) {
- page = ERR_PTR(-ENOMEM);
+ ret = try_grab_page(page, flags);
+ if (unlikely(ret)) {
+ page = ERR_PTR(ret);
goto out;
}
/*
@@ -960,10 +963,9 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
goto unmap;
*page = pte_page(*pte);
}
- if (unlikely(!try_grab_page(*page, gup_flags))) {
- ret = -ENOMEM;
+ ret = try_grab_page(*page, gup_flags);
+ if (unlikely(ret))
goto unmap;
- }
out:
ret = 0;
unmap:
@@ -2536,7 +2538,7 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
}
SetPageReferenced(page);
pages[*nr] = page;
- if (unlikely(!try_grab_page(page, flags))) {
+ if (unlikely(try_grab_page(page, flags))) {
undo_dev_pagemap(nr, nr_start, flags, pages);
break;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cc4a5f4791e..52f2b2a2ffae 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1035,6 +1035,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn = pmd_pfn(*pmd);
struct mm_struct *mm = vma->vm_mm;
struct page *page;
+ int ret;
assert_spin_locked(pmd_lockptr(mm, pmd));
@@ -1066,8 +1067,9 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
if (!*pgmap)
return ERR_PTR(-EFAULT);
page = pfn_to_page(pfn);
- if (!try_grab_page(page, flags))
- page = ERR_PTR(-ENOMEM);
+ ret = try_grab_page(page, flags);
+ if (ret)
+ page = ERR_PTR(ret);
return page;
}
@@ -1193,6 +1195,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn = pud_pfn(*pud);
struct mm_struct *mm = vma->vm_mm;
struct page *page;
+ int ret;
assert_spin_locked(pud_lockptr(mm, pud));
@@ -1226,8 +1229,10 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
if (!*pgmap)
return ERR_PTR(-EFAULT);
page = pfn_to_page(pfn);
- if (!try_grab_page(page, flags))
- page = ERR_PTR(-ENOMEM);
+
+ ret = try_grab_page(page, flags);
+ if (ret)
+ page = ERR_PTR(ret);
return page;
}
@@ -1435,6 +1440,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
{
struct mm_struct *mm = vma->vm_mm;
struct page *page;
+ int ret;
assert_spin_locked(pmd_lockptr(mm, pmd));
@@ -1459,8 +1465,9 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
!PageAnonExclusive(page), page);
- if (!try_grab_page(page, flags))
- return ERR_PTR(-ENOMEM);
+ ret = try_grab_page(page, flags);
+ if (ret)
+ return ERR_PTR(ret);
if (flags & FOLL_TOUCH)
touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b586cdd75930..e8d01a19ce46 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7224,14 +7224,15 @@ follow_huge_pmd_pte(struct vm_area_struct *vma, unsigned long address, int flags
page = pte_page(pte) +
((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
/*
- * try_grab_page() should always succeed here, because: a) we
- * hold the pmd (ptl) lock, and b) we've just checked that the
- * huge pmd (head) page is present in the page tables. The ptl
- * prevents the head page and tail pages from being rearranged
- * in any way. So this page must be available at this point,
- * unless the page refcount overflowed:
+ * try_grab_page() should always be able to get the page here,
+ * because: a) we hold the pmd (ptl) lock, and b) we've just
+ * checked that the huge pmd (head) page is present in the
+ * page tables. The ptl prevents the head page and tail pages
+ * from being rearranged in any way. So this page must be
+ * available at this point, unless the page refcount
+ * overflowed:
*/
- if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
+ if (try_grab_page(page, flags)) {
page = NULL;
goto out;
}
@@ -7269,7 +7270,7 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
pte = huge_ptep_get((pte_t *)pud);
if (pte_present(pte)) {
page = pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
- if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
+ if (try_grab_page(page, flags)) {
page = NULL;
goto out;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH v11 1/9] mm: allow multiple error returns in try_grab_page()
2022-10-21 17:41 ` [PATCH v11 1/9] mm: allow multiple error returns in try_grab_page() Logan Gunthorpe
@ 2022-10-24 15:00 ` Christoph Hellwig
2022-10-24 16:37 ` Dan Williams
2022-10-25 1:06 ` Chaitanya Kulkarni
2 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2022-10-24 15:00 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
On Fri, Oct 21, 2022 at 11:41:08AM -0600, Logan Gunthorpe wrote:
> In order to add checks for P2PDMA memory into try_grab_page(), expand
> the error return from a bool to an int/error code. Update all the
> callsites handle change in usage.
>
> Also remove the WARN_ON_ONCE() call at the callsites seeing there
> already is a WARN_ON_ONCE() inside the function if it fails.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 35+ messages in thread
* RE: [PATCH v11 1/9] mm: allow multiple error returns in try_grab_page()
2022-10-21 17:41 ` [PATCH v11 1/9] mm: allow multiple error returns in try_grab_page() Logan Gunthorpe
2022-10-24 15:00 ` Christoph Hellwig
@ 2022-10-24 16:37 ` Dan Williams
2022-10-25 1:06 ` Chaitanya Kulkarni
2 siblings, 0 replies; 35+ messages in thread
From: Dan Williams @ 2022-10-24 16:37 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, Logan Gunthorpe
Logan Gunthorpe wrote:
> In order to add checks for P2PDMA memory into try_grab_page(), expand
> the error return from a bool to an int/error code. Update all the
> callsites handle change in usage.
>
> Also remove the WARN_ON_ONCE() call at the callsites seeing there
> already is a WARN_ON_ONCE() inside the function if it fails.
Looks good,
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 1/9] mm: allow multiple error returns in try_grab_page()
2022-10-21 17:41 ` [PATCH v11 1/9] mm: allow multiple error returns in try_grab_page() Logan Gunthorpe
2022-10-24 15:00 ` Christoph Hellwig
2022-10-24 16:37 ` Dan Williams
@ 2022-10-25 1:06 ` Chaitanya Kulkarni
2 siblings, 0 replies; 35+ messages in thread
From: Chaitanya Kulkarni @ 2022-10-25 1:06 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
On 10/21/22 10:41, Logan Gunthorpe wrote:
> In order to add checks for P2PDMA memory into try_grab_page(), expand
> the error return from a bool to an int/error code. Update all the
> callsites handle change in usage.
>
> Also remove the WARN_ON_ONCE() call at the callsites seeing there
> already is a WARN_ON_ONCE() inside the function if it fails.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH v11 2/9] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
2022-10-21 17:41 [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
2022-10-21 17:41 ` [PATCH v11 1/9] mm: allow multiple error returns in try_grab_page() Logan Gunthorpe
@ 2022-10-21 17:41 ` Logan Gunthorpe
2022-10-24 15:00 ` Christoph Hellwig
2022-10-25 1:09 ` Chaitanya Kulkarni
2022-10-21 17:41 ` [PATCH v11 3/9] iov_iter: introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
` (8 subsequent siblings)
10 siblings, 2 replies; 35+ messages in thread
From: Logan Gunthorpe @ 2022-10-21 17:41 UTC (permalink / raw)
To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, Logan Gunthorpe
GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If GUP is called without the flag and a
P2PDMA page is found, it will return an error in try_grab_page() or
try_grab_folio().
The check is safe to do before taking the reference to the page in both
cases seeing the page should be protected by either the appropriate
ptl or mmap_lock; or the gup fast guarantees preventing TLB flushes.
try_grab_folio() has one call site that WARNs on failure and cannot
actually deal with the failure of this function (it seems it will
get into an infinite loop). Expand the comment there to document a
couple more conditions on why it will not fail.
FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set. This is to copy
fsdax until pgmap refcounts are fixed (see the link below for more
information).
Link: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.ca
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
include/linux/mm.h | 1 +
mm/gup.c | 19 ++++++++++++++++++-
mm/hugetlb.c | 6 ++++--
3 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 62a91dc1272b..6b081a8dcf88 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2958,6 +2958,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA 0x100000 /* allow returning PCI P2PDMA pages */
/*
* FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index e2f447446384..29e28f020f0b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -123,6 +123,9 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
*/
struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
{
+ if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
+ return NULL;
+
if (flags & FOLL_GET)
return try_get_folio(page, refs);
else if (flags & FOLL_PIN) {
@@ -216,6 +219,9 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
if (WARN_ON_ONCE(folio_ref_count(folio) <= 0))
return -ENOMEM;
+ if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
+ return -EREMOTEIO;
+
if (flags & FOLL_GET)
folio_ref_inc(folio);
else if (flags & FOLL_PIN) {
@@ -631,6 +637,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
page = ERR_PTR(ret);
goto out;
}
+
/*
* We need to make the page accessible if and only if we are going
* to access its content (the FOLL_PIN case). Please see
@@ -1060,6 +1067,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
return -EOPNOTSUPP;
+ if ((gup_flags & FOLL_LONGTERM) && (gup_flags & FOLL_PCI_P2PDMA))
+ return -EOPNOTSUPP;
+
if (vma_is_secretmem(vma))
return -EFAULT;
@@ -2536,6 +2546,12 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
undo_dev_pagemap(nr, nr_start, flags, pages);
break;
}
+
+ if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+ undo_dev_pagemap(nr, nr_start, flags, pages);
+ break;
+ }
+
SetPageReferenced(page);
pages[*nr] = page;
if (unlikely(try_grab_page(page, flags))) {
@@ -3020,7 +3036,8 @@ static int internal_get_user_pages_fast(unsigned long start,
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
FOLL_FORCE | FOLL_PIN | FOLL_GET |
- FOLL_FAST_ONLY | FOLL_NOFAULT)))
+ FOLL_FAST_ONLY | FOLL_NOFAULT |
+ FOLL_PCI_P2PDMA)))
return -EINVAL;
if (gup_flags & FOLL_PIN)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e8d01a19ce46..a55adfbacedb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6342,8 +6342,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
* tables. If the huge page is present, then the tail
* pages must also be present. The ptl prevents the
* head page and tail pages from being rearranged in
- * any way. So this page must be available at this
- * point, unless the page refcount overflowed:
+ * any way. As this is hugetlb, the pages will never
+ * be p2pdma or not longterm pinable. So this page
+ * must be available at this point, unless the page
+ * refcount overflowed:
*/
if (WARN_ON_ONCE(!try_grab_folio(pages[i], refs,
flags))) {
--
2.30.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH v11 2/9] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
2022-10-21 17:41 ` [PATCH v11 2/9] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
@ 2022-10-24 15:00 ` Christoph Hellwig
2022-10-25 1:09 ` Chaitanya Kulkarni
1 sibling, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2022-10-24 15:00 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 2/9] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
2022-10-21 17:41 ` [PATCH v11 2/9] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
2022-10-24 15:00 ` Christoph Hellwig
@ 2022-10-25 1:09 ` Chaitanya Kulkarni
1 sibling, 0 replies; 35+ messages in thread
From: Chaitanya Kulkarni @ 2022-10-25 1:09 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
On 10/21/22 10:41, Logan Gunthorpe wrote:
> GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
> allow obtaining P2PDMA pages. If GUP is called without the flag and a
> P2PDMA page is found, it will return an error in try_grab_page() or
> try_grab_folio().
>
> The check is safe to do before taking the reference to the page in both
> cases seeing the page should be protected by either the appropriate
> ptl or mmap_lock; or the gup fast guarantees preventing TLB flushes.
>
> try_grab_folio() has one call site that WARNs on failure and cannot
> actually deal with the failure of this function (it seems it will
> get into an infinite loop). Expand the comment there to document a
> couple more conditions on why it will not fail.
>
> FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set. This is to copy
> fsdax until pgmap refcounts are fixed (see the link below for more
> information).
>
> Link: https://lkml.kernel.org/r/Yy4Ot5MoOhsgYLTQ@ziepe.ca
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH v11 3/9] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
2022-10-21 17:41 [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
2022-10-21 17:41 ` [PATCH v11 1/9] mm: allow multiple error returns in try_grab_page() Logan Gunthorpe
2022-10-21 17:41 ` [PATCH v11 2/9] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages Logan Gunthorpe
@ 2022-10-21 17:41 ` Logan Gunthorpe
2022-10-25 1:14 ` Chaitanya Kulkarni
2022-10-27 7:11 ` Jay Fang
2022-10-21 17:41 ` [PATCH v11 4/9] block: add check when merging zone device pages Logan Gunthorpe
` (7 subsequent siblings)
10 siblings, 2 replies; 35+ messages in thread
From: Logan Gunthorpe @ 2022-10-21 17:41 UTC (permalink / raw)
To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, Logan Gunthorpe
Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().
This is so that FOLL_PCI_P2PDMA can be passed when appropriate.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
include/linux/uio.h | 6 ++++++
lib/iov_iter.c | 32 ++++++++++++++++++++++++--------
2 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 2e3134b14ffd..9ede533ce64c 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -247,8 +247,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode
void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
loff_t start, size_t count);
+ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
+ size_t maxsize, unsigned maxpages, size_t *start,
+ unsigned gup_flags);
ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
+ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+ struct page ***pages, size_t maxsize, size_t *start,
+ unsigned gup_flags);
ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
int iov_iter_npages(const struct iov_iter *i, int maxpages);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c3ca28ca68a6..53efad017f3c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1430,7 +1430,8 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize,
- unsigned int maxpages, size_t *start)
+ unsigned int maxpages, size_t *start,
+ unsigned int gup_flags)
{
unsigned int n;
@@ -1442,7 +1443,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
maxsize = MAX_RW_COUNT;
if (likely(user_backed_iter(i))) {
- unsigned int gup_flags = 0;
unsigned long addr;
int res;
@@ -1492,33 +1492,49 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
return -EFAULT;
}
-ssize_t iov_iter_get_pages2(struct iov_iter *i,
+ssize_t iov_iter_get_pages(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
- size_t *start)
+ size_t *start, unsigned gup_flags)
{
if (!maxpages)
return 0;
BUG_ON(!pages);
- return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
+ return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
+ start, gup_flags);
+}
+EXPORT_SYMBOL_GPL(iov_iter_get_pages);
+
+ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
+ size_t maxsize, unsigned maxpages, size_t *start)
+{
+ return iov_iter_get_pages(i, pages, maxsize, maxpages, start, 0);
}
EXPORT_SYMBOL(iov_iter_get_pages2);
-ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
+ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize,
- size_t *start)
+ size_t *start, unsigned gup_flags)
{
ssize_t len;
*pages = NULL;
- len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start);
+ len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
+ gup_flags);
if (len <= 0) {
kvfree(*pages);
*pages = NULL;
}
return len;
}
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc);
+
+ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
+ struct page ***pages, size_t maxsize, size_t *start)
+{
+ return iov_iter_get_pages_alloc(i, pages, maxsize, start, 0);
+}
EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
--
2.30.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH v11 3/9] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
2022-10-21 17:41 ` [PATCH v11 3/9] iov_iter: introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
@ 2022-10-25 1:14 ` Chaitanya Kulkarni
2022-10-25 15:35 ` Logan Gunthorpe
2022-10-27 7:11 ` Jay Fang
1 sibling, 1 reply; 35+ messages in thread
From: Chaitanya Kulkarni @ 2022-10-25 1:14 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
On 10/21/22 10:41, Logan Gunthorpe wrote:
> Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
> which take a flags argument that is passed to get_user_pages_fast().
>
> This is so that FOLL_PCI_P2PDMA can be passed when appropriate.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
> include/linux/uio.h | 6 ++++++
> lib/iov_iter.c | 32 ++++++++++++++++++++++++--------
> 2 files changed, 30 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 2e3134b14ffd..9ede533ce64c 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -247,8 +247,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode
> void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
> void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
> loff_t start, size_t count);
> +ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
> + size_t maxsize, unsigned maxpages, size_t *start,
> + unsigned gup_flags);
> ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
> size_t maxsize, unsigned maxpages, size_t *start);
> +ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
> + struct page ***pages, size_t maxsize, size_t *start,
> + unsigned gup_flags);
> ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
> size_t maxsize, size_t *start);
> int iov_iter_npages(const struct iov_iter *i, int maxpages);
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index c3ca28ca68a6..53efad017f3c 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1430,7 +1430,8 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
>
> static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
> struct page ***pages, size_t maxsize,
> - unsigned int maxpages, size_t *start)
> + unsigned int maxpages, size_t *start,
> + unsigned int gup_flags)
> {
> unsigned int n;
>
> @@ -1442,7 +1443,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
> maxsize = MAX_RW_COUNT;
>
> if (likely(user_backed_iter(i))) {
> - unsigned int gup_flags = 0;
> unsigned long addr;
> int res;
>
> @@ -1492,33 +1492,49 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
> return -EFAULT;
> }
>
> -ssize_t iov_iter_get_pages2(struct iov_iter *i,
> +ssize_t iov_iter_get_pages(struct iov_iter *i,
> struct page **pages, size_t maxsize, unsigned maxpages,
> - size_t *start)
> + size_t *start, unsigned gup_flags)
> {
> if (!maxpages)
> return 0;
> BUG_ON(!pages);
>
> - return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
> + return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
> + start, gup_flags);
> +}
> +EXPORT_SYMBOL_GPL(iov_iter_get_pages);
> +
> +ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
> + size_t maxsize, unsigned maxpages, size_t *start)
> +{
> + return iov_iter_get_pages(i, pages, maxsize, maxpages, start, 0);
> }
> EXPORT_SYMBOL(iov_iter_get_pages2);
>
> -ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
> +ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
> struct page ***pages, size_t maxsize,
> - size_t *start)
> + size_t *start, unsigned gup_flags)
> {
> ssize_t len;
>
> *pages = NULL;
>
> - len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start);
> + len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
> + gup_flags);
> if (len <= 0) {
> kvfree(*pages);
> *pages = NULL;
> }
> return len;
> }
> +EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc);
> +
> +ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
> + struct page ***pages, size_t maxsize, size_t *start)
> +{
> + return iov_iter_get_pages_alloc(i, pages, maxsize, start, 0);
> +}
> EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
Just one minor question why not make following functions
EXPORT_SYMBOL_GPL() ?
1. iov_iter_get_pages2()
2. iov_iter_get_pages_alloc2()
Reviewed-by: Chaitanya Kukkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 3/9] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
2022-10-25 1:14 ` Chaitanya Kulkarni
@ 2022-10-25 15:35 ` Logan Gunthorpe
2022-10-25 15:41 ` Christoph Hellwig
0 siblings, 1 reply; 35+ messages in thread
From: Logan Gunthorpe @ 2022-10-25 15:35 UTC (permalink / raw)
To: Chaitanya Kulkarni, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
On 2022-10-24 19:14, Chaitanya Kulkarni wrote:
> On 10/21/22 10:41, Logan Gunthorpe wrote:
>> Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
>> which take a flags argument that is passed to get_user_pages_fast().
>>
>> This is so that FOLL_PCI_P2PDMA can be passed when appropriate.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> ---
>> include/linux/uio.h | 6 ++++++
>> lib/iov_iter.c | 32 ++++++++++++++++++++++++--------
>> 2 files changed, 30 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/uio.h b/include/linux/uio.h
>> index 2e3134b14ffd..9ede533ce64c 100644
>> --- a/include/linux/uio.h
>> +++ b/include/linux/uio.h
>> @@ -247,8 +247,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode
>> void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
>> void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
>> loff_t start, size_t count);
>> +ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
>> + size_t maxsize, unsigned maxpages, size_t *start,
>> + unsigned gup_flags);
>> ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
>> size_t maxsize, unsigned maxpages, size_t *start);
>> +ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
>> + struct page ***pages, size_t maxsize, size_t *start,
>> + unsigned gup_flags);
>> ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
>> size_t maxsize, size_t *start);
>> int iov_iter_npages(const struct iov_iter *i, int maxpages);
>> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
>> index c3ca28ca68a6..53efad017f3c 100644
>> --- a/lib/iov_iter.c
>> +++ b/lib/iov_iter.c
>> @@ -1430,7 +1430,8 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
>>
>> static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>> struct page ***pages, size_t maxsize,
>> - unsigned int maxpages, size_t *start)
>> + unsigned int maxpages, size_t *start,
>> + unsigned int gup_flags)
>> {
>> unsigned int n;
>>
>> @@ -1442,7 +1443,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>> maxsize = MAX_RW_COUNT;
>>
>> if (likely(user_backed_iter(i))) {
>> - unsigned int gup_flags = 0;
>> unsigned long addr;
>> int res;
>>
>> @@ -1492,33 +1492,49 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>> return -EFAULT;
>> }
>>
>> -ssize_t iov_iter_get_pages2(struct iov_iter *i,
>> +ssize_t iov_iter_get_pages(struct iov_iter *i,
>> struct page **pages, size_t maxsize, unsigned maxpages,
>> - size_t *start)
>> + size_t *start, unsigned gup_flags)
>> {
>> if (!maxpages)
>> return 0;
>> BUG_ON(!pages);
>>
>> - return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
>> + return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
>> + start, gup_flags);
>> +}
>> +EXPORT_SYMBOL_GPL(iov_iter_get_pages);
>> +
>> +ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
>> + size_t maxsize, unsigned maxpages, size_t *start)
>> +{
>> + return iov_iter_get_pages(i, pages, maxsize, maxpages, start, 0);
>> }
>> EXPORT_SYMBOL(iov_iter_get_pages2);
>>
>> -ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
>> +ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
>> struct page ***pages, size_t maxsize,
>> - size_t *start)
>> + size_t *start, unsigned gup_flags)
>> {
>> ssize_t len;
>>
>> *pages = NULL;
>>
>> - len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start);
>> + len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
>> + gup_flags);
>> if (len <= 0) {
>> kvfree(*pages);
>> *pages = NULL;
>> }
>> return len;
>> }
>> +EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc);
>> +
>> +ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
>> + struct page ***pages, size_t maxsize, size_t *start)
>> +{
>> + return iov_iter_get_pages_alloc(i, pages, maxsize, start, 0);
>> +}
>> EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
> Just one minor question why not make following functions
> EXPORT_SYMBOL_GPL() ?
>
> 1. iov_iter_get_pages2()
> 2. iov_iter_get_pages_alloc2()
They previously were not GPL, so I didn't think that should be changed
in this patch.
Thanks for the review!
Logan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 3/9] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
2022-10-25 15:35 ` Logan Gunthorpe
@ 2022-10-25 15:41 ` Christoph Hellwig
0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2022-10-25 15:41 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Chaitanya Kulkarni, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm, Christoph Hellwig, Greg Kroah-Hartman,
Dan Williams, Jason Gunthorpe, Christian König,
John Hubbard, Don Dutile, Matthew Wilcox, Daniel Vetter,
Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin,
Bjorn Helgaas, Ira Weiny, Robin Murphy, Martin Oliveira,
Chaitanya Kulkarni, Ralph Campbell, Stephen Bates
> > Just one minor question why not make following functions
> > EXPORT_SYMBOL_GPL() ?
> >
> > 1. iov_iter_get_pages2()
> > 2. iov_iter_get_pages_alloc2()
>
> They previously were not GPL, so I didn't think that should be changed
> in this patch.
Yes. While they should have been _GPL from the start, rocking that
boat is a bit pointless now. We just need to make sure to do the
right thing for the pinning variants that are going to replace them soon.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 3/9] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
2022-10-21 17:41 ` [PATCH v11 3/9] iov_iter: introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
2022-10-25 1:14 ` Chaitanya Kulkarni
@ 2022-10-27 7:11 ` Jay Fang
2022-10-27 14:22 ` Logan Gunthorpe
1 sibling, 1 reply; 35+ messages in thread
From: Jay Fang @ 2022-10-27 7:11 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
On 2022/10/22 1:41, Logan Gunthorpe wrote:
> Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
> which take a flags argument that is passed to get_user_pages_fast().
>
> This is so that FOLL_PCI_P2PDMA can be passed when appropriate.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
> include/linux/uio.h | 6 ++++++
> lib/iov_iter.c | 32 ++++++++++++++++++++++++--------
> 2 files changed, 30 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 2e3134b14ffd..9ede533ce64c 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -247,8 +247,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode
> void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
> void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
> loff_t start, size_t count);
> +ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
> + size_t maxsize, unsigned maxpages, size_t *start,
> + unsigned gup_flags);
> ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
> size_t maxsize, unsigned maxpages, size_t *start);
> +ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
> + struct page ***pages, size_t maxsize, size_t *start,
> + unsigned gup_flags);
> ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
> size_t maxsize, size_t *start);
> int iov_iter_npages(const struct iov_iter *i, int maxpages);
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index c3ca28ca68a6..53efad017f3c 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1430,7 +1430,8 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
>
> static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
> struct page ***pages, size_t maxsize,
> - unsigned int maxpages, size_t *start)
> + unsigned int maxpages, size_t *start,
> + unsigned int gup_flags)
Hi,
found some checkpatch warnings, like this:
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
#50: FILE: lib/iov_iter.c:1497:
+ size_t *start, unsigned gup_flags)
> {
> unsigned int n;
>
> @@ -1442,7 +1443,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
> maxsize = MAX_RW_COUNT;
>
> if (likely(user_backed_iter(i))) {
> - unsigned int gup_flags = 0;
> unsigned long addr;
> int res;
>
> @@ -1492,33 +1492,49 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
> return -EFAULT;
> }
>
> -ssize_t iov_iter_get_pages2(struct iov_iter *i,
> +ssize_t iov_iter_get_pages(struct iov_iter *i,
> struct page **pages, size_t maxsize, unsigned maxpages,
> - size_t *start)
> + size_t *start, unsigned gup_flags)
> {
> if (!maxpages)
> return 0;
> BUG_ON(!pages);
>
> - return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
> + return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
> + start, gup_flags);
> +}
> +EXPORT_SYMBOL_GPL(iov_iter_get_pages);
> +
> +ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
> + size_t maxsize, unsigned maxpages, size_t *start)
> +{
> + return iov_iter_get_pages(i, pages, maxsize, maxpages, start, 0);
> }
> EXPORT_SYMBOL(iov_iter_get_pages2);
>
> -ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
> +ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
> struct page ***pages, size_t maxsize,
> - size_t *start)
> + size_t *start, unsigned gup_flags)
> {
> ssize_t len;
>
> *pages = NULL;
>
> - len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start);
> + len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
> + gup_flags);
> if (len <= 0) {
> kvfree(*pages);
> *pages = NULL;
> }
> return len;
> }
> +EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc);
> +
> +ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
> + struct page ***pages, size_t maxsize, size_t *start)
> +{
> + return iov_iter_get_pages_alloc(i, pages, maxsize, start, 0);
> +}
> EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
>
> size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 3/9] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
2022-10-27 7:11 ` Jay Fang
@ 2022-10-27 14:22 ` Logan Gunthorpe
0 siblings, 0 replies; 35+ messages in thread
From: Logan Gunthorpe @ 2022-10-27 14:22 UTC (permalink / raw)
To: Jay Fang, linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
On 2022-10-27 01:11, Jay Fang wrote:
> On 2022/10/22 1:41, Logan Gunthorpe wrote:
>> Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
>> which take a flags argument that is passed to get_user_pages_fast().
>>
>> This is so that FOLL_PCI_P2PDMA can be passed when appropriate.
>>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> ---
>> include/linux/uio.h | 6 ++++++
>> lib/iov_iter.c | 32 ++++++++++++++++++++++++--------
>> 2 files changed, 30 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/uio.h b/include/linux/uio.h
>> index 2e3134b14ffd..9ede533ce64c 100644
>> --- a/include/linux/uio.h
>> +++ b/include/linux/uio.h
>> @@ -247,8 +247,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode
>> void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
>> void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
>> loff_t start, size_t count);
>> +ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
>> + size_t maxsize, unsigned maxpages, size_t *start,
>> + unsigned gup_flags);
>> ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
>> size_t maxsize, unsigned maxpages, size_t *start);
>> +ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
>> + struct page ***pages, size_t maxsize, size_t *start,
>> + unsigned gup_flags);
>> ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
>> size_t maxsize, size_t *start);
>> int iov_iter_npages(const struct iov_iter *i, int maxpages);
>> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
>> index c3ca28ca68a6..53efad017f3c 100644
>> --- a/lib/iov_iter.c
>> +++ b/lib/iov_iter.c
>> @@ -1430,7 +1430,8 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
>>
>> static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>> struct page ***pages, size_t maxsize,
>> - unsigned int maxpages, size_t *start)
>> + unsigned int maxpages, size_t *start,
>> + unsigned int gup_flags)
>
> Hi,
> found some checkpatch warnings, like this:
> WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
> #50: FILE: lib/iov_iter.c:1497:
> + size_t *start, unsigned gup_flags)
We usually stick with the choices of the nearby code instead of
the warnings of checkpatch.
Thanks,
Logan
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH v11 4/9] block: add check when merging zone device pages
2022-10-21 17:41 [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
` (2 preceding siblings ...)
2022-10-21 17:41 ` [PATCH v11 3/9] iov_iter: introduce iov_iter_get_pages_[alloc_]flags() Logan Gunthorpe
@ 2022-10-21 17:41 ` Logan Gunthorpe
2022-10-25 1:16 ` Chaitanya Kulkarni
2022-10-21 17:41 ` [PATCH v11 5/9] lib/scatterlist: " Logan Gunthorpe
` (6 subsequent siblings)
10 siblings, 1 reply; 35+ messages in thread
From: Logan Gunthorpe @ 2022-10-21 17:41 UTC (permalink / raw)
To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, Logan Gunthorpe
Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.
Add a helper to determine if zone device pages are mergeable and use
this helper in page_is_mergeable().
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
block/bio.c | 2 ++
include/linux/mmzone.h | 24 ++++++++++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/block/bio.c b/block/bio.c
index 633a902468ec..439469370b7c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -863,6 +863,8 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
return false;
if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
return false;
+ if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
+ return false;
*same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
if (*same_page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5f74891556f3..9c49ec5d0e25 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -986,6 +986,25 @@ static inline bool is_zone_device_page(const struct page *page)
{
return page_zonenum(page) == ZONE_DEVICE;
}
+
+/*
+ * Consecutive zone device pages should not be merged into the same sgl
+ * or bvec segment with other types of pages or if they belong to different
+ * pgmaps. Otherwise getting the pgmap of a given segment is not possible
+ * without scanning the entire segment. This helper returns true either if
+ * both pages are not zone device pages or both pages are zone device pages
+ * with the same pgmap.
+ */
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+ const struct page *b)
+{
+ if (is_zone_device_page(a) != is_zone_device_page(b))
+ return false;
+ if (!is_zone_device_page(a))
+ return true;
+ return a->pgmap == b->pgmap;
+}
+
extern void memmap_init_zone_device(struct zone *, unsigned long,
unsigned long, struct dev_pagemap *);
#else
@@ -993,6 +1012,11 @@ static inline bool is_zone_device_page(const struct page *page)
{
return false;
}
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+ const struct page *b)
+{
+ return true;
+}
#endif
static inline bool folio_is_zone_device(const struct folio *folio)
--
2.30.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH v11 4/9] block: add check when merging zone device pages
2022-10-21 17:41 ` [PATCH v11 4/9] block: add check when merging zone device pages Logan Gunthorpe
@ 2022-10-25 1:16 ` Chaitanya Kulkarni
0 siblings, 0 replies; 35+ messages in thread
From: Chaitanya Kulkarni @ 2022-10-25 1:16 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
On 10/21/22 10:41, Logan Gunthorpe wrote:
> Consecutive zone device pages should not be merged into the same sgl
> or bvec segment with other types of pages or if they belong to different
> pgmaps. Otherwise getting the pgmap of a given segment is not possible
> without scanning the entire segment. This helper returns true either if
> both pages are not zone device pages or both pages are zone device
> pages with the same pgmap.
>
> Add a helper to determine if zone device pages are mergeable and use
> this helper in page_is_mergeable().
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> ---
> block/bio.c | 2 ++
> include/linux/mmzone.h | 24 ++++++++++++++++++++++++
> 2 files changed, 26 insertions(+)
>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH v11 5/9] lib/scatterlist: add check when merging zone device pages
2022-10-21 17:41 [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
` (3 preceding siblings ...)
2022-10-21 17:41 ` [PATCH v11 4/9] block: add check when merging zone device pages Logan Gunthorpe
@ 2022-10-21 17:41 ` Logan Gunthorpe
2022-10-25 1:19 ` Chaitanya Kulkarni
2022-10-21 17:41 ` [PATCH v11 6/9] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
` (5 subsequent siblings)
10 siblings, 1 reply; 35+ messages in thread
From: Logan Gunthorpe @ 2022-10-21 17:41 UTC (permalink / raw)
To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, Logan Gunthorpe
Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.
Factor out the check for page mergability into a pages_are_mergable()
helper and add a check with zone_device_pages_are_mergeable().
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
lib/scatterlist.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c8c3d675845c..a0ad2a7959b5 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -410,6 +410,15 @@ static struct scatterlist *get_next_sg(struct sg_append_table *table,
return new_sg;
}
+static bool pages_are_mergeable(struct page *a, struct page *b)
+{
+ if (page_to_pfn(a) != page_to_pfn(b) + 1)
+ return false;
+ if (!zone_device_pages_have_same_pgmap(a, b))
+ return false;
+ return true;
+}
+
/**
* sg_alloc_append_table_from_pages - Allocate and initialize an append sg
* table from an array of pages
@@ -447,6 +456,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
unsigned int chunks, cur_page, seg_len, i, prv_len = 0;
unsigned int added_nents = 0;
struct scatterlist *s = sgt_append->prv;
+ struct page *last_pg;
/*
* The algorithm below requires max_segment to be aligned to PAGE_SIZE
@@ -460,21 +470,17 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
return -EOPNOTSUPP;
if (sgt_append->prv) {
- unsigned long paddr =
- (page_to_pfn(sg_page(sgt_append->prv)) * PAGE_SIZE +
- sgt_append->prv->offset + sgt_append->prv->length) /
- PAGE_SIZE;
-
if (WARN_ON(offset))
return -EINVAL;
/* Merge contiguous pages into the last SG */
prv_len = sgt_append->prv->length;
- while (n_pages && page_to_pfn(pages[0]) == paddr) {
+ last_pg = sg_page(sgt_append->prv);
+ while (n_pages && pages_are_mergeable(last_pg, pages[0])) {
if (sgt_append->prv->length + PAGE_SIZE > max_segment)
break;
sgt_append->prv->length += PAGE_SIZE;
- paddr++;
+ last_pg = pages[0];
pages++;
n_pages--;
}
@@ -488,7 +494,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
for (i = 1; i < n_pages; i++) {
seg_len += PAGE_SIZE;
if (seg_len >= max_segment ||
- page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1) {
+ !pages_are_mergeable(pages[i], pages[i - 1])) {
chunks++;
seg_len = 0;
}
@@ -504,8 +510,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
for (j = cur_page + 1; j < n_pages; j++) {
seg_len += PAGE_SIZE;
if (seg_len >= max_segment ||
- page_to_pfn(pages[j]) !=
- page_to_pfn(pages[j - 1]) + 1)
+ !pages_are_mergeable(pages[j], pages[j - 1]))
break;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH v11 5/9] lib/scatterlist: add check when merging zone device pages
2022-10-21 17:41 ` [PATCH v11 5/9] lib/scatterlist: " Logan Gunthorpe
@ 2022-10-25 1:19 ` Chaitanya Kulkarni
0 siblings, 0 replies; 35+ messages in thread
From: Chaitanya Kulkarni @ 2022-10-25 1:19 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
On 10/21/22 10:41, Logan Gunthorpe wrote:
> Consecutive zone device pages should not be merged into the same sgl
> or bvec segment with other types of pages or if they belong to different
> pgmaps. Otherwise getting the pgmap of a given segment is not possible
> without scanning the entire segment. This helper returns true either if
> both pages are not zone device pages or both pages are zone device
> pages with the same pgmap.
>
> Factor out the check for page mergability into a pages_are_mergable()
> helper and add a check with zone_device_pages_are_mergeable().
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
> lib/scatterlist.c | 25 +++++++++++++++----------
> 1 file changed, 15 insertions(+), 10 deletions(-)
>
> diff --git a/lib/scatterlist.c b/lib/scatterlist.c
> index c8c3d675845c..a0ad2a7959b5 100644
> --- a/lib/scatterlist.c
> +++ b/lib/scatterlist.c
> @@ -410,6 +410,15 @@ static struct scatterlist *get_next_sg(struct sg_append_table *table,
> return new_sg;
> }
>
> +static bool pages_are_mergeable(struct page *a, struct page *b)
> +{
> + if (page_to_pfn(a) != page_to_pfn(b) + 1)
> + return false;
> + if (!zone_device_pages_have_same_pgmap(a, b))
> + return false;
> + return true;
> +}
> +
not sure if it makes sense to make it inline ? either way,
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH v11 6/9] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
2022-10-21 17:41 [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
` (4 preceding siblings ...)
2022-10-21 17:41 ` [PATCH v11 5/9] lib/scatterlist: " Logan Gunthorpe
@ 2022-10-21 17:41 ` Logan Gunthorpe
2022-10-25 1:23 ` Chaitanya Kulkarni
2022-10-25 1:25 ` Chaitanya Kulkarni
2022-10-21 17:41 ` [PATCH v11 7/9] block: set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
` (4 subsequent siblings)
10 siblings, 2 replies; 35+ messages in thread
From: Logan Gunthorpe @ 2022-10-21 17:41 UTC (permalink / raw)
To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, Logan Gunthorpe
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
block/bio.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 439469370b7c..a7abf9b1b66a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1197,6 +1197,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
+ unsigned int gup_flags = 0;
ssize_t size, left;
unsigned len, i = 0;
size_t offset, trim;
@@ -1210,6 +1211,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
+ if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
+ gup_flags |= FOLL_PCI_P2PDMA;
+
/*
* Each segment in the iov is required to be a block size multiple.
* However, we may not be able to get the entire segment if it spans
@@ -1217,8 +1221,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
* result to ensure the bio's total size is correct. The remainder of
* the iov data will be picked up in the next bio iteration.
*/
- size = iov_iter_get_pages2(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
- nr_pages, &offset);
+ size = iov_iter_get_pages(iter, pages,
+ UINT_MAX - bio->bi_iter.bi_size,
+ nr_pages, &offset, gup_flags);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
--
2.30.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH v11 6/9] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
2022-10-21 17:41 ` [PATCH v11 6/9] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
@ 2022-10-25 1:23 ` Chaitanya Kulkarni
2022-10-25 15:37 ` Logan Gunthorpe
2022-10-25 1:25 ` Chaitanya Kulkarni
1 sibling, 1 reply; 35+ messages in thread
From: Chaitanya Kulkarni @ 2022-10-25 1:23 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
/*
> * Each segment in the iov is required to be a block size multiple.
> * However, we may not be able to get the entire segment if it spans
> @@ -1217,8 +1221,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> * result to ensure the bio's total size is correct. The remainder of
> * the iov data will be picked up in the next bio iteration.
> */
> - size = iov_iter_get_pages2(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
> - nr_pages, &offset);
> + size = iov_iter_get_pages(iter, pages,
> + UINT_MAX - bio->bi_iter.bi_size,
> + nr_pages, &offset, gup_flags);
nit, 3rd param in above call fits on the first line ? plz check :-
iov_iter_get_pages(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
nr_pages, &offset, gup_flags);
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 6/9] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
2022-10-25 1:23 ` Chaitanya Kulkarni
@ 2022-10-25 15:37 ` Logan Gunthorpe
0 siblings, 0 replies; 35+ messages in thread
From: Logan Gunthorpe @ 2022-10-25 15:37 UTC (permalink / raw)
To: Chaitanya Kulkarni, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
On 2022-10-24 19:23, Chaitanya Kulkarni wrote:
> /*
>> * Each segment in the iov is required to be a block size multiple.
>> * However, we may not be able to get the entire segment if it spans
>> @@ -1217,8 +1221,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>> * result to ensure the bio's total size is correct. The remainder of
>> * the iov data will be picked up in the next bio iteration.
>> */
>> - size = iov_iter_get_pages2(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
>> - nr_pages, &offset);
>> + size = iov_iter_get_pages(iter, pages,
>> + UINT_MAX - bio->bi_iter.bi_size,
>> + nr_pages, &offset, gup_flags);
>
> nit, 3rd param in above call fits on the first line ? plz check :-
>
> iov_iter_get_pages(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
> nr_pages, &offset, gup_flags);
Oh, yup, this just fits. I'll queue up the fix for if I send v12.
Logan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 6/9] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
2022-10-21 17:41 ` [PATCH v11 6/9] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
2022-10-25 1:23 ` Chaitanya Kulkarni
@ 2022-10-25 1:25 ` Chaitanya Kulkarni
1 sibling, 0 replies; 35+ messages in thread
From: Chaitanya Kulkarni @ 2022-10-25 1:25 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
* Each segment in the iov is required to be a block size multiple.
> * However, we may not be able to get the entire segment if it spans
> @@ -1217,8 +1221,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> * result to ensure the bio's total size is correct. The remainder of
> * the iov data will be picked up in the next bio iteration.
> */
> - size = iov_iter_get_pages2(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
> - nr_pages, &offset);
> + size = iov_iter_get_pages(iter, pages,
> + UINT_MAX - bio->bi_iter.bi_size,
> + nr_pages, &offset, gup_flags);
nit:-
3rd parameter in the above call fits on the 1st line? plz check
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH v11 7/9] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
2022-10-21 17:41 [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
` (5 preceding siblings ...)
2022-10-21 17:41 ` [PATCH v11 6/9] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages() Logan Gunthorpe
@ 2022-10-21 17:41 ` Logan Gunthorpe
2022-10-25 1:26 ` Chaitanya Kulkarni
2022-10-21 17:41 ` [PATCH v11 8/9] PCI/P2PDMA: Allow userspace VMA allocations through sysfs Logan Gunthorpe
` (3 subsequent siblings)
10 siblings, 1 reply; 35+ messages in thread
From: Logan Gunthorpe @ 2022-10-21 17:41 UTC (permalink / raw)
To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, Logan Gunthorpe
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
block/blk-map.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index 34735626b00f..8750f82d7da4 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -267,6 +267,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
{
unsigned int max_sectors = queue_max_hw_sectors(rq->q);
unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
+ unsigned int gup_flags = 0;
struct bio *bio;
int ret;
int j;
@@ -278,6 +279,9 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (bio == NULL)
return -ENOMEM;
+ if (blk_queue_pci_p2pdma(rq->q))
+ gup_flags |= FOLL_PCI_P2PDMA;
+
while (iov_iter_count(iter)) {
struct page **pages, *stack_pages[UIO_FASTIOV];
ssize_t bytes;
@@ -286,11 +290,11 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
pages = stack_pages;
- bytes = iov_iter_get_pages2(iter, pages, LONG_MAX,
- nr_vecs, &offs);
+ bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
+ nr_vecs, &offs, gup_flags);
} else {
- bytes = iov_iter_get_pages_alloc2(iter, &pages,
- LONG_MAX, &offs);
+ bytes = iov_iter_get_pages_alloc(iter, &pages,
+ LONG_MAX, &offs, gup_flags);
}
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
--
2.30.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH v11 7/9] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
2022-10-21 17:41 ` [PATCH v11 7/9] block: set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
@ 2022-10-25 1:26 ` Chaitanya Kulkarni
0 siblings, 0 replies; 35+ messages in thread
From: Chaitanya Kulkarni @ 2022-10-25 1:26 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
On 10/21/22 10:41, Logan Gunthorpe wrote:
> When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
> iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
> passed from userspace and enables the NVMe passthru requests to
> use P2PDMA pages.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> ---
> block/blk-map.c | 12 ++++++++----
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH v11 8/9] PCI/P2PDMA: Allow userspace VMA allocations through sysfs
2022-10-21 17:41 [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
` (6 preceding siblings ...)
2022-10-21 17:41 ` [PATCH v11 7/9] block: set FOLL_PCI_P2PDMA in bio_map_user_iov() Logan Gunthorpe
@ 2022-10-21 17:41 ` Logan Gunthorpe
2022-10-25 1:29 ` Chaitanya Kulkarni
2022-10-25 1:34 ` Chaitanya Kulkarni
2022-10-21 17:41 ` [PATCH v11 9/9] ABI: sysfs-bus-pci: add documentation for p2pmem allocate Logan Gunthorpe
` (2 subsequent siblings)
10 siblings, 2 replies; 35+ messages in thread
From: Logan Gunthorpe @ 2022-10-21 17:41 UTC (permalink / raw)
To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, Logan Gunthorpe, Bjorn Helgaas
Create a sysfs bin attribute called "allocate" under the existing
"p2pmem" group. The only allowable operation on this file is the mmap()
call.
When mmap() is called on this attribute, the kernel allocates a chunk of
memory from the genalloc and inserts the pages into the VMA. The
dev_pagemap .page_free callback will indicate when these pages are no
longer used and they will be put back into the genalloc.
On device unbind, remove the sysfs file before the memremap_pages are
cleaned up. This ensures unmap_mapping_range() is called on the files
inode and no new mappings can be created.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
drivers/pci/p2pdma.c | 124 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 124 insertions(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 88dc66ee1c46..27539770a613 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -89,6 +89,90 @@ static ssize_t published_show(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR_RO(published);
+static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *attr, struct vm_area_struct *vma)
+{
+ struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
+ size_t len = vma->vm_end - vma->vm_start;
+ struct pci_p2pdma *p2pdma;
+ struct percpu_ref *ref;
+ unsigned long vaddr;
+ void *kaddr;
+ int ret;
+
+ /* prevent private mappings from being established */
+ if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+ pci_info_ratelimited(pdev,
+ "%s: fail, attempted private mapping\n",
+ current->comm);
+ return -EINVAL;
+ }
+
+ if (vma->vm_pgoff) {
+ pci_info_ratelimited(pdev,
+ "%s: fail, attempted mapping with non-zero offset\n",
+ current->comm);
+ return -EINVAL;
+ }
+
+ rcu_read_lock();
+ p2pdma = rcu_dereference(pdev->p2pdma);
+ if (!p2pdma) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref);
+ if (!kaddr) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /*
+ * vm_insert_page() can sleep, so a reference is taken to mapping
+ * such that rcu_read_unlock() can be done before inserting the
+ * pages
+ */
+ if (unlikely(!percpu_ref_tryget_live_rcu(ref))) {
+ ret = -ENODEV;
+ goto out_free_mem;
+ }
+ rcu_read_unlock();
+
+ for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
+ ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr));
+ if (ret) {
+ gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
+ return ret;
+ }
+ percpu_ref_get(ref);
+ put_page(virt_to_page(kaddr));
+ kaddr += PAGE_SIZE;
+ len -= PAGE_SIZE;
+ }
+
+ percpu_ref_put(ref);
+
+ return 0;
+out_free_mem:
+ gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
+out:
+ rcu_read_unlock();
+ return ret;
+}
+
+static struct bin_attribute p2pmem_alloc_attr = {
+ .attr = { .name = "allocate", .mode = 0660 },
+ .mmap = p2pmem_alloc_mmap,
+ /*
+ * Some places where we want to call mmap (ie. python) will check
+ * that the file size is greater than the mmap size before allowing
+ * the mmap to continue. To work around this, just set the size
+ * to be very large.
+ */
+ .size = SZ_1T,
+};
+
static struct attribute *p2pmem_attrs[] = {
&dev_attr_size.attr,
&dev_attr_available.attr,
@@ -96,11 +180,32 @@ static struct attribute *p2pmem_attrs[] = {
NULL,
};
+static struct bin_attribute *p2pmem_bin_attrs[] = {
+ &p2pmem_alloc_attr,
+ NULL,
+};
+
static const struct attribute_group p2pmem_group = {
.attrs = p2pmem_attrs,
+ .bin_attrs = p2pmem_bin_attrs,
.name = "p2pmem",
};
+static void p2pdma_page_free(struct page *page)
+{
+ struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+ struct percpu_ref *ref;
+
+ gen_pool_free_owner(pgmap->provider->p2pdma->pool,
+ (uintptr_t)page_to_virt(page), PAGE_SIZE,
+ (void **)&ref);
+ percpu_ref_put(ref);
+}
+
+static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
+ .page_free = p2pdma_page_free,
+};
+
static void pci_p2pdma_release(void *data)
{
struct pci_dev *pdev = data;
@@ -152,6 +257,19 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
return error;
}
+static void pci_p2pdma_unmap_mappings(void *data)
+{
+ struct pci_dev *pdev = data;
+
+ /*
+ * Removing the alloc attribute from sysfs will call
+ * unmap_mapping_range() on the inode, teardown any existing userspace
+ * mappings and prevent new ones from being created.
+ */
+ sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
+ p2pmem_group.name);
+}
+
/**
* pci_p2pdma_add_resource - add memory for use as p2p memory
* @pdev: the device to add the memory to
@@ -198,6 +316,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
pgmap->range.end = pgmap->range.start + size - 1;
pgmap->nr_range = 1;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+ pgmap->ops = &p2pdma_pgmap_ops;
p2p_pgmap->provider = pdev;
p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
@@ -209,6 +328,11 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
goto pgmap_free;
}
+ error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
+ pdev);
+ if (error)
+ goto pages_free;
+
p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr,
pci_bus_address(pdev, bar) + offset,
--
2.30.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH v11 8/9] PCI/P2PDMA: Allow userspace VMA allocations through sysfs
2022-10-21 17:41 ` [PATCH v11 8/9] PCI/P2PDMA: Allow userspace VMA allocations through sysfs Logan Gunthorpe
@ 2022-10-25 1:29 ` Chaitanya Kulkarni
2022-10-25 1:34 ` Chaitanya Kulkarni
1 sibling, 0 replies; 35+ messages in thread
From: Chaitanya Kulkarni @ 2022-10-25 1:29 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, Bjorn Helgaas
On 10/21/22 10:41, Logan Gunthorpe wrote:
> Create a sysfs bin attribute called "allocate" under the existing
> "p2pmem" group. The only allowable operation on this file is the mmap()
> call.
>
> When mmap() is called on this attribute, the kernel allocates a chunk of
> memory from the genalloc and inserts the pages into the VMA. The
> dev_pagemap .page_free callback will indicate when these pages are no
> longer used and they will be put back into the genalloc.
>
> On device unbind, remove the sysfs file before the memremap_pages are
> cleaned up. This ensures unmap_mapping_range() is called on the files
> inode and no new mappings can be created.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 8/9] PCI/P2PDMA: Allow userspace VMA allocations through sysfs
2022-10-21 17:41 ` [PATCH v11 8/9] PCI/P2PDMA: Allow userspace VMA allocations through sysfs Logan Gunthorpe
2022-10-25 1:29 ` Chaitanya Kulkarni
@ 2022-10-25 1:34 ` Chaitanya Kulkarni
1 sibling, 0 replies; 35+ messages in thread
From: Chaitanya Kulkarni @ 2022-10-25 1:34 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, Bjorn Helgaas
On 10/21/22 10:41, Logan Gunthorpe wrote:
> Create a sysfs bin attribute called "allocate" under the existing
> "p2pmem" group. The only allowable operation on this file is the mmap()
> call.
>
> When mmap() is called on this attribute, the kernel allocates a chunk of
> memory from the genalloc and inserts the pages into the VMA. The
> dev_pagemap .page_free callback will indicate when these pages are no
> longer used and they will be put back into the genalloc.
>
> On device unbind, remove the sysfs file before the memremap_pages are
> cleaned up. This ensures unmap_mapping_range() is called on the files
> inode and no new mappings can be created.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
> drivers/pci/p2pdma.c | 124 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 124 insertions(+)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 88dc66ee1c46..27539770a613 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -89,6 +89,90 @@ static ssize_t published_show(struct device *dev, struct device_attribute *attr,
> }
> static DEVICE_ATTR_RO(published);
>
> +static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
> + struct bin_attribute *attr, struct vm_area_struct *vma)
> +{
> + struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
> + size_t len = vma->vm_end - vma->vm_start;
> + struct pci_p2pdma *p2pdma;
> + struct percpu_ref *ref;
> + unsigned long vaddr;
> + void *kaddr;
> + int ret;
> +
> + /* prevent private mappings from being established */
> + if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
> + pci_info_ratelimited(pdev,
> + "%s: fail, attempted private mapping\n",
> + current->comm);
> + return -EINVAL;
> + }
> +
> + if (vma->vm_pgoff) {
> + pci_info_ratelimited(pdev,
> + "%s: fail, attempted mapping with non-zero offset\n",
> + current->comm);
> + return -EINVAL;
> + }
> +
> + rcu_read_lock();
> + p2pdma = rcu_dereference(pdev->p2pdma);
> + if (!p2pdma) {
> + ret = -ENODEV;
> + goto out;
> + }
> +
> + kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref);
> + if (!kaddr) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + /*
> + * vm_insert_page() can sleep, so a reference is taken to mapping
> + * such that rcu_read_unlock() can be done before inserting the
> + * pages
> + */
> + if (unlikely(!percpu_ref_tryget_live_rcu(ref))) {
> + ret = -ENODEV;
> + goto out_free_mem;
> + }
> + rcu_read_unlock();
> +
> + for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
> + ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr));
> + if (ret) {
> + gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> + return ret;
> + }
> + percpu_ref_get(ref);
> + put_page(virt_to_page(kaddr));
> + kaddr += PAGE_SIZE;
> + len -= PAGE_SIZE;
> + }
> +
> + percpu_ref_put(ref);
> +
> + return 0;
> +out_free_mem:
> + gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +
> +static struct bin_attribute p2pmem_alloc_attr = {
> + .attr = { .name = "allocate", .mode = 0660 },
> + .mmap = p2pmem_alloc_mmap,
> + /*
> + * Some places where we want to call mmap (ie. python) will check
> + * that the file size is greater than the mmap size before allowing
> + * the mmap to continue. To work around this, just set the size
> + * to be very large.
> + */
> + .size = SZ_1T,
> +};
> +
> static struct attribute *p2pmem_attrs[] = {
> &dev_attr_size.attr,
> &dev_attr_available.attr,
> @@ -96,11 +180,32 @@ static struct attribute *p2pmem_attrs[] = {
> NULL,
> };
>
> +static struct bin_attribute *p2pmem_bin_attrs[] = {
> + &p2pmem_alloc_attr,
> + NULL,
> +};
> +
> static const struct attribute_group p2pmem_group = {
> .attrs = p2pmem_attrs,
> + .bin_attrs = p2pmem_bin_attrs,
> .name = "p2pmem",
> };
>
> +static void p2pdma_page_free(struct page *page)
> +{
> + struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
> + struct percpu_ref *ref;
> +
> + gen_pool_free_owner(pgmap->provider->p2pdma->pool,
> + (uintptr_t)page_to_virt(page), PAGE_SIZE,
> + (void **)&ref);
> + percpu_ref_put(ref);
> +}
> +
> +static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
> + .page_free = p2pdma_page_free,
> +};
> +
> static void pci_p2pdma_release(void *data)
> {
> struct pci_dev *pdev = data;
> @@ -152,6 +257,19 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
> return error;
> }
>
> +static void pci_p2pdma_unmap_mappings(void *data)
> +{
> + struct pci_dev *pdev = data;
> +
> + /*
> + * Removing the alloc attribute from sysfs will call
> + * unmap_mapping_range() on the inode, teardown any existing userspace
> + * mappings and prevent new ones from being created.
> + */
> + sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
> + p2pmem_group.name);
> +}
> +
> /**
> * pci_p2pdma_add_resource - add memory for use as p2p memory
> * @pdev: the device to add the memory to
> @@ -198,6 +316,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> pgmap->range.end = pgmap->range.start + size - 1;
> pgmap->nr_range = 1;
> pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
> + pgmap->ops = &p2pdma_pgmap_ops;
>
> p2p_pgmap->provider = pdev;
> p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
> @@ -209,6 +328,11 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> goto pgmap_free;
> }
>
> + error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
> + pdev);
> + if (error)
> + goto pages_free;
> +
> p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
> error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr,
> pci_bus_address(pdev, bar) + offset,
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH v11 9/9] ABI: sysfs-bus-pci: add documentation for p2pmem allocate
2022-10-21 17:41 [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
` (7 preceding siblings ...)
2022-10-21 17:41 ` [PATCH v11 8/9] PCI/P2PDMA: Allow userspace VMA allocations through sysfs Logan Gunthorpe
@ 2022-10-21 17:41 ` Logan Gunthorpe
2022-10-25 1:29 ` Chaitanya Kulkarni
2022-10-24 15:03 ` [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Christoph Hellwig
2022-11-09 18:44 ` Jens Axboe
10 siblings, 1 reply; 35+ messages in thread
From: Logan Gunthorpe @ 2022-10-21 17:41 UTC (permalink / raw)
To: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, Logan Gunthorpe
Add documentation for the p2pmem/allocate binary file which allows
for allocating p2pmem buffers in userspace for passing to drivers
that support them. (Currently only O_DIRECT to NVMe devices.)
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
Documentation/ABI/testing/sysfs-bus-pci | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 840727fc75dc..ecf47559f495 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -407,6 +407,16 @@ Description:
file contains a '1' if the memory has been published for
use outside the driver that owns the device.
+What: /sys/bus/pci/devices/.../p2pmem/allocate
+Date: August 2022
+Contact: Logan Gunthorpe <logang@deltatee.com>
+Description:
+ This file allows mapping p2pmem into userspace. For each
+ mmap() call on this file, the kernel will allocate a chunk
+ of Peer-to-Peer memory for use in Peer-to-Peer transactions.
+ This memory can be used in O_DIRECT calls to NVMe backed
+ files for Peer-to-Peer copies.
+
What: /sys/bus/pci/devices/.../link/clkpm
/sys/bus/pci/devices/.../link/l0s_aspm
/sys/bus/pci/devices/.../link/l1_aspm
--
2.30.2
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH v11 9/9] ABI: sysfs-bus-pci: add documentation for p2pmem allocate
2022-10-21 17:41 ` [PATCH v11 9/9] ABI: sysfs-bus-pci: add documentation for p2pmem allocate Logan Gunthorpe
@ 2022-10-25 1:29 ` Chaitanya Kulkarni
0 siblings, 0 replies; 35+ messages in thread
From: Chaitanya Kulkarni @ 2022-10-25 1:29 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme, linux-block,
linux-pci, linux-mm
Cc: Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates
On 10/21/22 10:41, Logan Gunthorpe wrote:
> Add documentation for the p2pmem/allocate binary file which allows
> for allocating p2pmem buffers in userspace for passing to drivers
> that support them. (Currently only O_DIRECT to NVMe devices.)
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
-ck
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices
2022-10-21 17:41 [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
` (8 preceding siblings ...)
2022-10-21 17:41 ` [PATCH v11 9/9] ABI: sysfs-bus-pci: add documentation for p2pmem allocate Logan Gunthorpe
@ 2022-10-24 15:03 ` Christoph Hellwig
2022-10-24 19:15 ` John Hubbard
2022-11-09 18:44 ` Jens Axboe
10 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2022-10-24 15:03 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
Christoph Hellwig, Greg Kroah-Hartman, Dan Williams,
Jason Gunthorpe, Christian König, John Hubbard, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, viro
The series looks good to me know. How do we want to handle it? I think
we need a special branch somewhere (maybe in the block or mm trees?)
so that we can base the other iov_iter work from John on it. Also
Al has a whole bunch of iov_iter changes that we probably want on
the same branch as well, although some of those (READ vs WRITE fixups)
look like 6.1 material to me.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices
2022-10-24 15:03 ` [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Christoph Hellwig
@ 2022-10-24 19:15 ` John Hubbard
2022-11-08 6:56 ` Christoph Hellwig
0 siblings, 1 reply; 35+ messages in thread
From: John Hubbard @ 2022-10-24 19:15 UTC (permalink / raw)
To: Christoph Hellwig, Logan Gunthorpe
Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
Greg Kroah-Hartman, Dan Williams, Jason Gunthorpe,
Christian König, Don Dutile, Matthew Wilcox, Daniel Vetter,
Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin,
Bjorn Helgaas, Ira Weiny, Robin Murphy, Martin Oliveira,
Chaitanya Kulkarni, Ralph Campbell, Stephen Bates, viro
On 10/24/22 08:03, Christoph Hellwig wrote:
> The series looks good to me know. How do we want to handle it? I think
> we need a special branch somewhere (maybe in the block or mm trees?)
> so that we can base the other iov_iter work from John on it. Also
> Al has a whole bunch of iov_iter changes that we probably want on
> the same branch as well, although some of those (READ vs WRITE fixups)
> look like 6.1 material to me.
>
A little earlier, Jens graciously offered [1] to provide a topic branch,
such as:
for-6.2/block-gup [2]
(I've moved the name forward from 6.1 to 6.2, because that discussion
was 7 weeks ago.)
[1] https://lore.kernel.org/ae675a01-90e6-4af1-6c43-660b3a6c7b72@kernel.dk
[2] https://lore.kernel.org/55a2d67f-9a12-9fe6-d73b-8c3f5eb36f31@kernel.dk
thanks,
--
John Hubbard
NVIDIA
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices
2022-10-24 19:15 ` John Hubbard
@ 2022-11-08 6:56 ` Christoph Hellwig
2022-11-09 17:28 ` Logan Gunthorpe
0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2022-11-08 6:56 UTC (permalink / raw)
To: John Hubbard
Cc: Christoph Hellwig, Logan Gunthorpe, linux-kernel, linux-nvme,
linux-block, linux-pci, linux-mm, Greg Kroah-Hartman,
Dan Williams, Jason Gunthorpe, Christian König, Don Dutile,
Matthew Wilcox, Daniel Vetter, Minturn Dave B, Jason Ekstrand,
Dave Hansen, Xiong Jianxin, Bjorn Helgaas, Ira Weiny,
Robin Murphy, Martin Oliveira, Chaitanya Kulkarni,
Ralph Campbell, Stephen Bates, viro
On Mon, Oct 24, 2022 at 12:15:56PM -0700, John Hubbard wrote:
> A little earlier, Jens graciously offered [1] to provide a topic branch,
> such as:
>
> for-6.2/block-gup [2]
>
> (I've moved the name forward from 6.1 to 6.2, because that discussion
> was 7 weeks ago.)
So what are we going to do with this series? It would be sad to miss
the merge window again.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices
2022-11-08 6:56 ` Christoph Hellwig
@ 2022-11-09 17:28 ` Logan Gunthorpe
2022-11-09 18:33 ` Jens Axboe
0 siblings, 1 reply; 35+ messages in thread
From: Logan Gunthorpe @ 2022-11-09 17:28 UTC (permalink / raw)
To: Christoph Hellwig, John Hubbard, Jens Axboe
Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
Greg Kroah-Hartman, Dan Williams, Jason Gunthorpe,
Christian König, Don Dutile, Matthew Wilcox, Daniel Vetter,
Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin,
Bjorn Helgaas, Ira Weiny, Robin Murphy, Martin Oliveira,
Chaitanya Kulkarni, Ralph Campbell, Stephen Bates, viro
@add Jens
On 2022-11-07 23:56, Christoph Hellwig wrote:
> On Mon, Oct 24, 2022 at 12:15:56PM -0700, John Hubbard wrote:
>> A little earlier, Jens graciously offered [1] to provide a topic branch,
>> such as:
>>
>> for-6.2/block-gup [2]
>>
>> (I've moved the name forward from 6.1 to 6.2, because that discussion
>> was 7 weeks ago.)
>
> So what are we going to do with this series? It would be sad to miss
> the merge window again.
I noticed Jens wasn't copied on this series. I've added him. It would be
nice to get this in someone's tree soon.
Thanks!
Logan
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices
2022-11-09 17:28 ` Logan Gunthorpe
@ 2022-11-09 18:33 ` Jens Axboe
0 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2022-11-09 18:33 UTC (permalink / raw)
To: Logan Gunthorpe, Christoph Hellwig, John Hubbard
Cc: linux-kernel, linux-nvme, linux-block, linux-pci, linux-mm,
Greg Kroah-Hartman, Dan Williams, Jason Gunthorpe,
Christian König, Don Dutile, Matthew Wilcox, Daniel Vetter,
Minturn Dave B, Jason Ekstrand, Dave Hansen, Xiong Jianxin,
Bjorn Helgaas, Ira Weiny, Robin Murphy, Martin Oliveira,
Chaitanya Kulkarni, Ralph Campbell, Stephen Bates, viro
On 11/9/22 10:28 AM, Logan Gunthorpe wrote:
> @add Jens
>
> On 2022-11-07 23:56, Christoph Hellwig wrote:
>> On Mon, Oct 24, 2022 at 12:15:56PM -0700, John Hubbard wrote:
>>> A little earlier, Jens graciously offered [1] to provide a topic branch,
>>> such as:
>>>
>>> for-6.2/block-gup [2]
>>>
>>> (I've moved the name forward from 6.1 to 6.2, because that discussion
>>> was 7 weeks ago.)
>>
>> So what are we going to do with this series? It would be sad to miss
>> the merge window again.
>
> I noticed Jens wasn't copied on this series. I've added him. It would be
> nice to get this in someone's tree soon.
I took a look and the series looks fine to me.
--
Jens Axboe
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices
2022-10-21 17:41 [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Logan Gunthorpe
` (9 preceding siblings ...)
2022-10-24 15:03 ` [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices Christoph Hellwig
@ 2022-11-09 18:44 ` Jens Axboe
10 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2022-11-09 18:44 UTC (permalink / raw)
To: linux-pci, Logan Gunthorpe, linux-nvme, linux-mm, linux-block,
linux-kernel
Cc: Minturn Dave B, Jason Gunthorpe, Stephen Bates, Xiong Jianxin,
Matthew Wilcox, Martin Oliveira, Chaitanya Kulkarni,
Bjorn Helgaas, Dave Hansen, Ralph Campbell, Dan Williams,
Christoph Hellwig, Jason Ekstrand, Robin Murphy, Ira Weiny,
John Hubbard, Christian König, Daniel Vetter,
Greg Kroah-Hartman, Don Dutile
On Fri, 21 Oct 2022 11:41:07 -0600, Logan Gunthorpe wrote:
> This is the latest P2PDMA userspace patch set. This version includes
> some cleanup from feedback from the last posting[1].
>
> This patch set enables userspace P2PDMA by allowing userspace to mmap()
> allocated chunks of the CMB. The resulting VMA can be passed only
> to O_DIRECT IO on NVMe backed files or block devices. A flag is added
> to GUP() in Patch 1, then Patches 2 through 6 wire this flag up based
> on whether the block queue indicates P2PDMA support. Patches 7
> creates the sysfs resource that can hand out the VMAs and Patch 8
> adds brief documentation for the new interface.
>
> [...]
Applied, thanks!
[1/9] mm: allow multiple error returns in try_grab_page()
commit: 0f0892356fa174bdd8bd655c820ee3658c4c9f01
[2/9] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
commit: 4003f107fa2eabb0aab90e37a1ed7b74c6f0d132
[3/9] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
commit: d82076403cef7fcd1e7617c9db48bf21ebdc1f9c
[4/9] block: add check when merging zone device pages
commit: 49580e690755d0e51ed7aa2c33225dd884fa738a
[5/9] lib/scatterlist: add check when merging zone device pages
commit: 1567b49d1a4081ba7e1a0ff2dc39bc58c59f2a51
[6/9] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
commit: 5e3e3f2e15df46abcab1959f93f214f778b6ec49
[7/9] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
commit: 7ee4ccf57484d260c37b29f9a48b65c4101403e8
[8/9] PCI/P2PDMA: Allow userspace VMA allocations through sysfs
commit: 7e9c7ef83d785236f5a8c3761dd053fae9b92fb8
[9/9] ABI: sysfs-bus-pci: add documentation for p2pmem allocate
commit: 6d4338cb4070a762dba0cadee00b7ec206d9f868
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 35+ messages in thread