All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
@ 2019-06-05  9:10 Pingfan Liu
  2019-06-05  9:10 ` [PATCHv3 2/2] mm/gup: rename nr as nr_pinned " Pingfan Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Pingfan Liu @ 2019-06-05  9:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Pingfan Liu, Ira Weiny, Andrew Morton, Mike Rapoport,
	Dan Williams, Matthew Wilcox, John Hubbard, Aneesh Kumar K.V,
	Keith Busch, Christoph Hellwig, linux-kernel

As for FOLL_LONGTERM, it is checked in the slow path
__gup_longterm_unlocked(). But it is not checked in the fast path, which
means a possible leak of CMA page to longterm pinned requirement through
this crack.

Place a check in the fast path.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: linux-kernel@vger.kernel.org
---
 mm/gup.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index f173fcb..0e59af9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
 	return ret;
 }
 
+#ifdef CONFIG_CMA
+static inline int reject_cma_pages(int nr_pinned, struct page **pages)
+{
+	int i;
+
+	for (i = 0; i < nr_pinned; i++)
+		if (is_migrate_cma_page(pages[i])) {
+			put_user_pages(pages + i, nr_pinned - i);
+			return i;
+		}
+
+	return nr_pinned;
+}
+#else
+static inline int reject_cma_pages(int nr_pinned, struct page **pages)
+{
+	return nr_pinned;
+}
+#endif
+
 /**
  * get_user_pages_fast() - pin user pages in memory
  * @start:	starting user address
@@ -2236,6 +2256,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 		ret = nr;
 	}
 
+	if (unlikely(gup_flags & FOLL_LONGTERM) && nr)
+		nr = reject_cma_pages(nr, pages);
+
 	if (nr < nr_pages) {
 		/* Try to get the remaining pages with get_user_pages */
 		start += nr << PAGE_SHIFT;
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCHv3 2/2] mm/gup: rename nr as nr_pinned in get_user_pages_fast()
  2019-06-05  9:10 [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast() Pingfan Liu
@ 2019-06-05  9:10 ` Pingfan Liu
  2019-06-05 21:49 ` [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM " Andrew Morton
  2019-06-11 16:15 ` Aneesh Kumar K.V
  2 siblings, 0 replies; 20+ messages in thread
From: Pingfan Liu @ 2019-06-05  9:10 UTC (permalink / raw)
  To: linux-mm
  Cc: Pingfan Liu, Ira Weiny, Andrew Morton, Mike Rapoport,
	Dan Williams, Matthew Wilcox, John Hubbard, Aneesh Kumar K.V,
	Keith Busch, Christoph Hellwig, linux-kernel

To better reflect the held state of pages and make code self-explaining,
rename nr as nr_pinned.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: linux-kernel@vger.kernel.org
---
 mm/gup.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 0e59af9..9b3c8a6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2236,7 +2236,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages)
 {
 	unsigned long addr, len, end;
-	int nr = 0, ret = 0;
+	int nr_pinned = 0, ret = 0;
 
 	start &= PAGE_MASK;
 	addr = start;
@@ -2251,28 +2251,28 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 
 	if (gup_fast_permitted(start, nr_pages)) {
 		local_irq_disable();
-		gup_pgd_range(addr, end, gup_flags, pages, &nr);
+		gup_pgd_range(addr, end, gup_flags, pages, &nr_pinned);
 		local_irq_enable();
-		ret = nr;
+		ret = nr_pinned;
 	}
 
-	if (unlikely(gup_flags & FOLL_LONGTERM) && nr)
-		nr = reject_cma_pages(nr, pages);
+	if (unlikely(gup_flags & FOLL_LONGTERM) && nr_pinned)
+		nr_pinned = reject_cma_pages(nr_pinned, pages);
 
-	if (nr < nr_pages) {
+	if (nr_pinned < nr_pages) {
 		/* Try to get the remaining pages with get_user_pages */
-		start += nr << PAGE_SHIFT;
-		pages += nr;
+		start += nr_pinned << PAGE_SHIFT;
+		pages += nr_pinned;
 
-		ret = __gup_longterm_unlocked(start, nr_pages - nr,
+		ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned,
 					      gup_flags, pages);
 
 		/* Have to be a bit careful with return values */
-		if (nr > 0) {
+		if (nr_pinned > 0) {
 			if (ret < 0)
-				ret = nr;
+				ret = nr_pinned;
 			else
-				ret += nr;
+				ret += nr_pinned;
 		}
 	}
 
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-05  9:10 [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast() Pingfan Liu
  2019-06-05  9:10 ` [PATCHv3 2/2] mm/gup: rename nr as nr_pinned " Pingfan Liu
@ 2019-06-05 21:49 ` Andrew Morton
  2019-06-06  2:19     ` Pingfan Liu
  2019-06-11 16:15 ` Aneesh Kumar K.V
  2 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2019-06-05 21:49 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: linux-mm, Ira Weiny, Mike Rapoport, Dan Williams, Matthew Wilcox,
	John Hubbard, Aneesh Kumar K.V, Keith Busch, Christoph Hellwig,
	linux-kernel

On Wed,  5 Jun 2019 17:10:19 +0800 Pingfan Liu <kernelfans@gmail.com> wrote:

> As for FOLL_LONGTERM, it is checked in the slow path
> __gup_longterm_unlocked(). But it is not checked in the fast path, which
> means a possible leak of CMA page to longterm pinned requirement through
> this crack.
> 
> Place a check in the fast path.

I'm not actually seeing a description (in either the existing code or
this changelog or patch) an explanation of *why* we wish to exclude CMA
pages from longterm pinning.

> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>  	return ret;
>  }
>  
> +#ifdef CONFIG_CMA
> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr_pinned; i++)
> +		if (is_migrate_cma_page(pages[i])) {
> +			put_user_pages(pages + i, nr_pinned - i);
> +			return i;
> +		}
> +
> +	return nr_pinned;
> +}

There's no point in inlining this.

The code seems inefficient.  If it encounters a single CMA page it can
end up discarding a possibly significant number of non-CMA pages.  I
guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
rare.  But could we avoid this (and the second pass across pages[]) by
checking for a CMA page within gup_pte_range()?

> +#else
> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> +{
> +	return nr_pinned;
> +}
> +#endif
> +
>  /**
>   * get_user_pages_fast() - pin user pages in memory
>   * @start:	starting user address
> @@ -2236,6 +2256,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>  		ret = nr;
>  	}
>  
> +	if (unlikely(gup_flags & FOLL_LONGTERM) && nr)
> +		nr = reject_cma_pages(nr, pages);
> +

This would be a suitable place to add a comment explaining why we're
doing this...


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-05 21:49 ` [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM " Andrew Morton
@ 2019-06-06  2:19     ` Pingfan Liu
  0 siblings, 0 replies; 20+ messages in thread
From: Pingfan Liu @ 2019-06-06  2:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Ira Weiny, Mike Rapoport, Dan Williams, Matthew Wilcox,
	John Hubbard, Aneesh Kumar K.V, Keith Busch, Christoph Hellwig,
	LKML

On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed,  5 Jun 2019 17:10:19 +0800 Pingfan Liu <kernelfans@gmail.com> wrote:
>
> > As for FOLL_LONGTERM, it is checked in the slow path
> > __gup_longterm_unlocked(). But it is not checked in the fast path, which
> > means a possible leak of CMA page to longterm pinned requirement through
> > this crack.
> >
> > Place a check in the fast path.
>
> I'm not actually seeing a description (in either the existing code or
> this changelog or patch) an explanation of *why* we wish to exclude CMA
> pages from longterm pinning.
>
What about a short description like this:
FOLL_LONGTERM suggests a pin which is going to be given to hardware
and can't move. It would truncate CMA permanently and should be
excluded.

> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> >       return ret;
> >  }
> >
> > +#ifdef CONFIG_CMA
> > +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < nr_pinned; i++)
> > +             if (is_migrate_cma_page(pages[i])) {
> > +                     put_user_pages(pages + i, nr_pinned - i);
> > +                     return i;
> > +             }
> > +
> > +     return nr_pinned;
> > +}
>
> There's no point in inlining this.
OK, will drop it in V4.

>
> The code seems inefficient.  If it encounters a single CMA page it can
> end up discarding a possibly significant number of non-CMA pages.  I
The trick is the page is not be discarded, in fact, they are still be
referrenced by pte. We just leave the slow path to pick up the non-CMA
pages again.

> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> rare.  But could we avoid this (and the second pass across pages[]) by
> checking for a CMA page within gup_pte_range()?
It will spread the same logic to hugetlb pte and normal pte. And no
improvement in performance due to slow path. So I think maybe it is
not worth.

>
> > +#else
> > +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > +{
> > +     return nr_pinned;
> > +}
> > +#endif
> > +
> >  /**
> >   * get_user_pages_fast() - pin user pages in memory
> >   * @start:   starting user address
> > @@ -2236,6 +2256,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
> >               ret = nr;
> >       }
> >
> > +     if (unlikely(gup_flags & FOLL_LONGTERM) && nr)
> > +             nr = reject_cma_pages(nr, pages);
> > +
>
> This would be a suitable place to add a comment explaining why we're
> doing this...
Would add one comment "FOLL_LONGTERM suggests a pin given to hardware
and rarely returned."

Thanks for your kind review.

Regards,
  Pingfan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
@ 2019-06-06  2:19     ` Pingfan Liu
  0 siblings, 0 replies; 20+ messages in thread
From: Pingfan Liu @ 2019-06-06  2:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Ira Weiny, Mike Rapoport, Dan Williams, Matthew Wilcox,
	John Hubbard, Aneesh Kumar K.V, Keith Busch, Christoph Hellwig,
	LKML

On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed,  5 Jun 2019 17:10:19 +0800 Pingfan Liu <kernelfans@gmail.com> wrote:
>
> > As for FOLL_LONGTERM, it is checked in the slow path
> > __gup_longterm_unlocked(). But it is not checked in the fast path, which
> > means a possible leak of CMA page to longterm pinned requirement through
> > this crack.
> >
> > Place a check in the fast path.
>
> I'm not actually seeing a description (in either the existing code or
> this changelog or patch) an explanation of *why* we wish to exclude CMA
> pages from longterm pinning.
>
What about a short description like this:
FOLL_LONGTERM suggests a pin which is going to be given to hardware
and can't move. It would truncate CMA permanently and should be
excluded.

> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> >       return ret;
> >  }
> >
> > +#ifdef CONFIG_CMA
> > +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < nr_pinned; i++)
> > +             if (is_migrate_cma_page(pages[i])) {
> > +                     put_user_pages(pages + i, nr_pinned - i);
> > +                     return i;
> > +             }
> > +
> > +     return nr_pinned;
> > +}
>
> There's no point in inlining this.
OK, will drop it in V4.

>
> The code seems inefficient.  If it encounters a single CMA page it can
> end up discarding a possibly significant number of non-CMA pages.  I
The trick is the page is not be discarded, in fact, they are still be
referrenced by pte. We just leave the slow path to pick up the non-CMA
pages again.

> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> rare.  But could we avoid this (and the second pass across pages[]) by
> checking for a CMA page within gup_pte_range()?
It will spread the same logic to hugetlb pte and normal pte. And no
improvement in performance due to slow path. So I think maybe it is
not worth.

>
> > +#else
> > +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > +{
> > +     return nr_pinned;
> > +}
> > +#endif
> > +
> >  /**
> >   * get_user_pages_fast() - pin user pages in memory
> >   * @start:   starting user address
> > @@ -2236,6 +2256,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
> >               ret = nr;
> >       }
> >
> > +     if (unlikely(gup_flags & FOLL_LONGTERM) && nr)
> > +             nr = reject_cma_pages(nr, pages);
> > +
>
> This would be a suitable place to add a comment explaining why we're
> doing this...
Would add one comment "FOLL_LONGTERM suggests a pin given to hardware
and rarely returned."

Thanks for your kind review.

Regards,
  Pingfan


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-06  2:19     ` Pingfan Liu
  (?)
@ 2019-06-06 21:17     ` John Hubbard
  2019-06-07  6:10         ` Pingfan Liu
  -1 siblings, 1 reply; 20+ messages in thread
From: John Hubbard @ 2019-06-06 21:17 UTC (permalink / raw)
  To: Pingfan Liu, Andrew Morton
  Cc: linux-mm, Ira Weiny, Mike Rapoport, Dan Williams, Matthew Wilcox,
	Aneesh Kumar K.V, Keith Busch, Christoph Hellwig, LKML

On 6/5/19 7:19 PM, Pingfan Liu wrote:
> On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
...
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>>>       return ret;
>>>  }
>>>
>>> +#ifdef CONFIG_CMA
>>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
>>> +{
>>> +     int i;
>>> +
>>> +     for (i = 0; i < nr_pinned; i++)
>>> +             if (is_migrate_cma_page(pages[i])) {
>>> +                     put_user_pages(pages + i, nr_pinned - i);
>>> +                     return i;
>>> +             }
>>> +
>>> +     return nr_pinned;
>>> +}
>>
>> There's no point in inlining this.
> OK, will drop it in V4.
> 
>>
>> The code seems inefficient.  If it encounters a single CMA page it can
>> end up discarding a possibly significant number of non-CMA pages.  I
> The trick is the page is not be discarded, in fact, they are still be
> referrenced by pte. We just leave the slow path to pick up the non-CMA
> pages again.
> 
>> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
>> rare.  But could we avoid this (and the second pass across pages[]) by
>> checking for a CMA page within gup_pte_range()?
> It will spread the same logic to hugetlb pte and normal pte. And no
> improvement in performance due to slow path. So I think maybe it is
> not worth.
> 
>>

I think the concern is: for the successful gup_fast case with no CMA
pages, this patch is adding another complete loop through all the 
pages. In the fast case.

If the check were instead done as part of the gup_pte_range(), then
it would be a little more efficient for that case.

As for whether it's worth it, *probably* this is too small an effect to measure. 
But in order to attempt a measurement: running fio (https://github.com/axboe/fio)
with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file 
that Jan Kara and Tom Talpey helped me come up with, for related testing:

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64



thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-06 21:17     ` John Hubbard
@ 2019-06-07  6:10         ` Pingfan Liu
  0 siblings, 0 replies; 20+ messages in thread
From: Pingfan Liu @ 2019-06-07  6:10 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, linux-mm, Ira Weiny, Mike Rapoport, Dan Williams,
	Matthew Wilcox, Aneesh Kumar K.V, Keith Busch, Christoph Hellwig,
	LKML

On Fri, Jun 7, 2019 at 5:17 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> ...
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> >>>       return ret;
> >>>  }
> >>>
> >>> +#ifdef CONFIG_CMA
> >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> >>> +{
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < nr_pinned; i++)
> >>> +             if (is_migrate_cma_page(pages[i])) {
> >>> +                     put_user_pages(pages + i, nr_pinned - i);
> >>> +                     return i;
> >>> +             }
> >>> +
> >>> +     return nr_pinned;
> >>> +}
> >>
> >> There's no point in inlining this.
> > OK, will drop it in V4.
> >
> >>
> >> The code seems inefficient.  If it encounters a single CMA page it can
> >> end up discarding a possibly significant number of non-CMA pages.  I
> > The trick is the page is not be discarded, in fact, they are still be
> > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > pages again.
> >
> >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> >> rare.  But could we avoid this (and the second pass across pages[]) by
> >> checking for a CMA page within gup_pte_range()?
> > It will spread the same logic to hugetlb pte and normal pte. And no
> > improvement in performance due to slow path. So I think maybe it is
> > not worth.
> >
> >>
>
> I think the concern is: for the successful gup_fast case with no CMA
> pages, this patch is adding another complete loop through all the
> pages. In the fast case.
>
> If the check were instead done as part of the gup_pte_range(), then
> it would be a little more efficient for that case.
>
> As for whether it's worth it, *probably* this is too small an effect to measure.
> But in order to attempt a measurement: running fio (https://github.com/axboe/fio)
> with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file
> that Jan Kara and Tom Talpey helped me come up with, for related testing:
>
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64
>
Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I
will try to bring out the result.

Thanks,
  Pingfan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
@ 2019-06-07  6:10         ` Pingfan Liu
  0 siblings, 0 replies; 20+ messages in thread
From: Pingfan Liu @ 2019-06-07  6:10 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, linux-mm, Ira Weiny, Mike Rapoport, Dan Williams,
	Matthew Wilcox, Aneesh Kumar K.V, Keith Busch, Christoph Hellwig,
	LKML

On Fri, Jun 7, 2019 at 5:17 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> ...
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> >>>       return ret;
> >>>  }
> >>>
> >>> +#ifdef CONFIG_CMA
> >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> >>> +{
> >>> +     int i;
> >>> +
> >>> +     for (i = 0; i < nr_pinned; i++)
> >>> +             if (is_migrate_cma_page(pages[i])) {
> >>> +                     put_user_pages(pages + i, nr_pinned - i);
> >>> +                     return i;
> >>> +             }
> >>> +
> >>> +     return nr_pinned;
> >>> +}
> >>
> >> There's no point in inlining this.
> > OK, will drop it in V4.
> >
> >>
> >> The code seems inefficient.  If it encounters a single CMA page it can
> >> end up discarding a possibly significant number of non-CMA pages.  I
> > The trick is the page is not be discarded, in fact, they are still be
> > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > pages again.
> >
> >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> >> rare.  But could we avoid this (and the second pass across pages[]) by
> >> checking for a CMA page within gup_pte_range()?
> > It will spread the same logic to hugetlb pte and normal pte. And no
> > improvement in performance due to slow path. So I think maybe it is
> > not worth.
> >
> >>
>
> I think the concern is: for the successful gup_fast case with no CMA
> pages, this patch is adding another complete loop through all the
> pages. In the fast case.
>
> If the check were instead done as part of the gup_pte_range(), then
> it would be a little more efficient for that case.
>
> As for whether it's worth it, *probably* this is too small an effect to measure.
> But in order to attempt a measurement: running fio (https://github.com/axboe/fio)
> with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file
> that Jan Kara and Tom Talpey helped me come up with, for related testing:
>
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64
>
Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I
will try to bring out the result.

Thanks,
  Pingfan


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-07  6:10         ` Pingfan Liu
  (?)
@ 2019-06-11 12:29         ` Pingfan Liu
  2019-06-11 13:52           ` Christoph Hellwig
  2019-06-11 16:47           ` Ira Weiny
  -1 siblings, 2 replies; 20+ messages in thread
From: Pingfan Liu @ 2019-06-11 12:29 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, linux-mm, Ira Weiny, Mike Rapoport, Dan Williams,
	Matthew Wilcox, Aneesh Kumar K.V, Keith Busch, Christoph Hellwig,
	LKML

[-- Attachment #1: Type: text/plain, Size: 4028 bytes --]

On Fri, Jun 07, 2019 at 02:10:15PM +0800, Pingfan Liu wrote:
> On Fri, Jun 7, 2019 at 5:17 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> > ...
> > >>> --- a/mm/gup.c
> > >>> +++ b/mm/gup.c
> > >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> > >>>       return ret;
> > >>>  }
> > >>>
> > >>> +#ifdef CONFIG_CMA
> > >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > >>> +{
> > >>> +     int i;
> > >>> +
> > >>> +     for (i = 0; i < nr_pinned; i++)
> > >>> +             if (is_migrate_cma_page(pages[i])) {
> > >>> +                     put_user_pages(pages + i, nr_pinned - i);
> > >>> +                     return i;
> > >>> +             }
> > >>> +
> > >>> +     return nr_pinned;
> > >>> +}
> > >>
> > >> There's no point in inlining this.
> > > OK, will drop it in V4.
> > >
> > >>
> > >> The code seems inefficient.  If it encounters a single CMA page it can
> > >> end up discarding a possibly significant number of non-CMA pages.  I
> > > The trick is the page is not be discarded, in fact, they are still be
> > > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > > pages again.
> > >
> > >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> > >> rare.  But could we avoid this (and the second pass across pages[]) by
> > >> checking for a CMA page within gup_pte_range()?
> > > It will spread the same logic to hugetlb pte and normal pte. And no
> > > improvement in performance due to slow path. So I think maybe it is
> > > not worth.
> > >
> > >>
> >
> > I think the concern is: for the successful gup_fast case with no CMA
> > pages, this patch is adding another complete loop through all the
> > pages. In the fast case.
> >
> > If the check were instead done as part of the gup_pte_range(), then
> > it would be a little more efficient for that case.
> >
> > As for whether it's worth it, *probably* this is too small an effect to measure.
> > But in order to attempt a measurement: running fio (https://github.com/axboe/fio)
> > with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file
> > that Jan Kara and Tom Talpey helped me come up with, for related testing:
> >
> > [reader]
> > direct=1
> > ioengine=libaio
> > blocksize=4096
> > size=1g
> > numjobs=1
> > rw=read
> > iodepth=64
> >
Unable to get a NVME device to have a test. And when testing fio on the
tranditional disk, I got the error "fio: engine libaio not loadable
fio: failed to load engine
fio: file:ioengines.c:89, func=dlopen, error=libaio: cannot open shared object file: No such file or directory"

But I found a test case which can be slightly adjusted to met the aim.
It is tools/testing/selftests/vm/gup_benchmark.c

Test enviroment:
  MemTotal:       264079324 kB
  MemFree:        262306788 kB
  CmaTotal:              0 kB
  CmaFree:               0 kB
  on AMD EPYC 7601

Test command:
  gup_benchmark -r 100 -n 64
  gup_benchmark -r 100 -n 64 -l
where -r stands for repeat times, -n is nr_pages param for
get_user_pages_fast(), -l is a new option to test FOLL_LONGTERM in fast
path, see a patch at the tail.

Test result:
w/o     477.800000
w/o-l   481.070000
a       481.800000
a-l     640.410000
b       466.240000  (question a: b outperforms w/o ?)
b-l     529.740000

Where w/o is baseline without any patch using v5.2-rc2, a is this series, b
does the check in gup_pte_range(). '-l' means FOLL_LONGTERM.

I am suprised that b-l has about 17% improvement than a. (640.41 -529.74)/640.41
As for "question a: b outperforms w/o ?", I can not figure out why, maybe it can be
considered as variance.

Based on the above result, I think it is better to do the check inside
gup_pte_range().

Any comment?

Thanks,


> Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I
> will try to bring out the result.
> 
> Thanks,
>   Pingfan
> 


[-- Attachment #2: gup_pte_range_check.patch --]
[-- Type: text/plain, Size: 1502 bytes --]

---
Patch to do check inside gup_pte_range()

diff --git a/mm/gup.c b/mm/gup.c
index 2ce3091..ba213a0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1757,6 +1757,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
+		if (unlikely(flags & FOLL_LONGTERM) &&
+			is_migrate_cma_page(page))
+				goto pte_unmap;
+
 		head = try_get_compound_head(page, 1);
 		if (!head)
 			goto pte_unmap;
@@ -1900,6 +1904,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
+	if (unlikely(flags & FOLL_LONGTERM) &&
+		is_migrate_cma_page(page)) {
+		*nr -= refs;
+		return 0;
+	}
+
 	head = try_get_compound_head(pmd_page(orig), refs);
 	if (!head) {
 		*nr -= refs;
@@ -1941,6 +1951,12 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
+	if (unlikely(flags & FOLL_LONGTERM) &&
+		is_migrate_cma_page(page)) {
+		*nr -= refs;
+		return 0;
+	}
+
 	head = try_get_compound_head(pud_page(orig), refs);
 	if (!head) {
 		*nr -= refs;
@@ -1978,6 +1994,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
+	if (unlikely(flags & FOLL_LONGTERM) &&
+		is_migrate_cma_page(page)) {
+		*nr -= refs;
+		return 0;
+	}
+
 	head = try_get_compound_head(pgd_page(orig), refs);
 	if (!head) {
 		*nr -= refs;

[-- Attachment #3: mm-gup-introduce-LONGTERM_BENCHMARK-in-fast-path.patch --]
[-- Type: text/plain, Size: 2525 bytes --]

---
Patch for testing

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d..61dec5f 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -6,8 +6,9 @@
 #include <linux/debugfs.h>
 
 #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
-#define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
+#define GUP_FAST_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
+#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 3, struct gup_benchmark)
+#define GUP_BENCHMARK		_IOWR('g', 4, struct gup_benchmark)
 
 struct gup_benchmark {
 	__u64 get_delta_usec;
@@ -53,6 +54,11 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
 			nr = get_user_pages_fast(addr, nr, gup->flags & 1,
 						 pages + i);
 			break;
+		case GUP_FAST_LONGTERM_BENCHMARK:
+			nr = get_user_pages_fast(addr, nr,
+						 (gup->flags & 1) | FOLL_LONGTERM,
+						 pages + i);
+			break;
 		case GUP_LONGTERM_BENCHMARK:
 			nr = get_user_pages(addr, nr,
 					    (gup->flags & 1) | FOLL_LONGTERM,
@@ -96,6 +102,7 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
 
 	switch (cmd) {
 	case GUP_FAST_BENCHMARK:
+	case GUP_FAST_LONGTERM_BENCHMARK:
 	case GUP_LONGTERM_BENCHMARK:
 	case GUP_BENCHMARK:
 		break;
diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index c0534e2..ade8acb 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -15,8 +15,9 @@
 #define PAGE_SIZE sysconf(_SC_PAGESIZE)
 
 #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
-#define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
+#define GUP_FAST_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
+#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 3, struct gup_benchmark)
+#define GUP_BENCHMARK		_IOWR('g', 4, struct gup_benchmark)
 
 struct gup_benchmark {
 	__u64 get_delta_usec;
@@ -37,7 +38,7 @@ int main(int argc, char **argv)
 	char *file = "/dev/zero";
 	char *p;
 
-	while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) {
+	while ((opt = getopt(argc, argv, "m:r:n:f:tTlLUSH")) != -1) {
 		switch (opt) {
 		case 'm':
 			size = atoi(optarg) * MB;
@@ -54,6 +55,9 @@ int main(int argc, char **argv)
 		case 'T':
 			thp = 0;
 			break;
+		case 'l':
+			cmd = GUP_FAST_LONGTERM_BENCHMARK;
+			break;
 		case 'L':
 			cmd = GUP_LONGTERM_BENCHMARK;
 			break;
-- 
2.7.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-11 12:29         ` Pingfan Liu
@ 2019-06-11 13:52           ` Christoph Hellwig
  2019-06-11 19:49             ` John Hubbard
  2019-06-11 16:47           ` Ira Weiny
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2019-06-11 13:52 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: John Hubbard, Andrew Morton, linux-mm, Ira Weiny, Mike Rapoport,
	Dan Williams, Matthew Wilcox, Aneesh Kumar K.V, Keith Busch,
	Christoph Hellwig, LKML

On Tue, Jun 11, 2019 at 08:29:35PM +0800, Pingfan Liu wrote:
> Unable to get a NVME device to have a test. And when testing fio on the

How would a nvme test help?  FOLL_LONGTERM isn't used by any performance
critical path to start with, so I don't see how this patch could be
a problem.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-05  9:10 [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast() Pingfan Liu
  2019-06-05  9:10 ` [PATCHv3 2/2] mm/gup: rename nr as nr_pinned " Pingfan Liu
  2019-06-05 21:49 ` [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM " Andrew Morton
@ 2019-06-11 16:15 ` Aneesh Kumar K.V
  2019-06-11 16:29   ` Weiny, Ira
  2 siblings, 1 reply; 20+ messages in thread
From: Aneesh Kumar K.V @ 2019-06-11 16:15 UTC (permalink / raw)
  To: Pingfan Liu, linux-mm
  Cc: Pingfan Liu, Ira Weiny, Andrew Morton, Mike Rapoport,
	Dan Williams, Matthew Wilcox, John Hubbard, Keith Busch,
	Christoph Hellwig, linux-kernel

Pingfan Liu <kernelfans@gmail.com> writes:

> As for FOLL_LONGTERM, it is checked in the slow path
> __gup_longterm_unlocked(). But it is not checked in the fast path, which
> means a possible leak of CMA page to longterm pinned requirement through
> this crack.

Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
dax check we need vma to ensure whether a long term pin is allowed or not.
If FOLL_LONGTERM is specified we should fallback to slow path.

-aneesh


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-11 16:15 ` Aneesh Kumar K.V
@ 2019-06-11 16:29   ` Weiny, Ira
  2019-06-12 13:54     ` Pingfan Liu
  0 siblings, 1 reply; 20+ messages in thread
From: Weiny, Ira @ 2019-06-11 16:29 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Pingfan Liu, linux-mm
  Cc: Pingfan Liu, Andrew Morton, Mike Rapoport, Williams, Dan J,
	Matthew Wilcox, John Hubbard, Busch, Keith, Christoph Hellwig,
	linux-kernel

> Pingfan Liu <kernelfans@gmail.com> writes:
> 
> > As for FOLL_LONGTERM, it is checked in the slow path
> > __gup_longterm_unlocked(). But it is not checked in the fast path,
> > which means a possible leak of CMA page to longterm pinned requirement
> > through this crack.
> 
> Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
> dax check we need vma to ensure whether a long term pin is allowed or not.
> If FOLL_LONGTERM is specified we should fallback to slow path.

Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX.  But it does this while walking the page tables.  I missed the CMA case and Pingfan's patch fixes this.  We could check for CMA pages while walking the page tables but most agreed that it was not worth it.  For DAX we already had checks for *_devmap() so it was easier to put the FOLL_LONGTERM checks there.

Ira


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-11 12:29         ` Pingfan Liu
  2019-06-11 13:52           ` Christoph Hellwig
@ 2019-06-11 16:47           ` Ira Weiny
  2019-06-12 14:10               ` Pingfan Liu
  1 sibling, 1 reply; 20+ messages in thread
From: Ira Weiny @ 2019-06-11 16:47 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: John Hubbard, Andrew Morton, linux-mm, Mike Rapoport,
	Dan Williams, Matthew Wilcox, Aneesh Kumar K.V, Keith Busch,
	Christoph Hellwig, LKML

On Tue, Jun 11, 2019 at 08:29:35PM +0800, Pingfan Liu wrote:
> On Fri, Jun 07, 2019 at 02:10:15PM +0800, Pingfan Liu wrote:
> > On Fri, Jun 7, 2019 at 5:17 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > >
> > > On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > > > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > ...
> > > >>> --- a/mm/gup.c
> > > >>> +++ b/mm/gup.c
> > > >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> > > >>>       return ret;
> > > >>>  }
> > > >>>
> > > >>> +#ifdef CONFIG_CMA
> > > >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > > >>> +{
> > > >>> +     int i;
> > > >>> +
> > > >>> +     for (i = 0; i < nr_pinned; i++)
> > > >>> +             if (is_migrate_cma_page(pages[i])) {
> > > >>> +                     put_user_pages(pages + i, nr_pinned - i);
> > > >>> +                     return i;
> > > >>> +             }
> > > >>> +
> > > >>> +     return nr_pinned;
> > > >>> +}
> > > >>
> > > >> There's no point in inlining this.
> > > > OK, will drop it in V4.
> > > >
> > > >>
> > > >> The code seems inefficient.  If it encounters a single CMA page it can
> > > >> end up discarding a possibly significant number of non-CMA pages.  I
> > > > The trick is the page is not be discarded, in fact, they are still be
> > > > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > > > pages again.
> > > >
> > > >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> > > >> rare.  But could we avoid this (and the second pass across pages[]) by
> > > >> checking for a CMA page within gup_pte_range()?
> > > > It will spread the same logic to hugetlb pte and normal pte. And no
> > > > improvement in performance due to slow path. So I think maybe it is
> > > > not worth.
> > > >
> > > >>
> > >
> > > I think the concern is: for the successful gup_fast case with no CMA
> > > pages, this patch is adding another complete loop through all the
> > > pages. In the fast case.
> > >
> > > If the check were instead done as part of the gup_pte_range(), then
> > > it would be a little more efficient for that case.
> > >
> > > As for whether it's worth it, *probably* this is too small an effect to measure.
> > > But in order to attempt a measurement: running fio (https://github.com/axboe/fio)
> > > with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file
> > > that Jan Kara and Tom Talpey helped me come up with, for related testing:
> > >
> > > [reader]
> > > direct=1
> > > ioengine=libaio
> > > blocksize=4096
> > > size=1g
> > > numjobs=1
> > > rw=read
> > > iodepth=64
> > >
> Unable to get a NVME device to have a test. And when testing fio on the
> tranditional disk, I got the error "fio: engine libaio not loadable
> fio: failed to load engine
> fio: file:ioengines.c:89, func=dlopen, error=libaio: cannot open shared object file: No such file or directory"
> 
> But I found a test case which can be slightly adjusted to met the aim.
> It is tools/testing/selftests/vm/gup_benchmark.c
> 
> Test enviroment:
>   MemTotal:       264079324 kB
>   MemFree:        262306788 kB
>   CmaTotal:              0 kB
>   CmaFree:               0 kB
>   on AMD EPYC 7601
> 
> Test command:
>   gup_benchmark -r 100 -n 64
>   gup_benchmark -r 100 -n 64 -l
> where -r stands for repeat times, -n is nr_pages param for
> get_user_pages_fast(), -l is a new option to test FOLL_LONGTERM in fast
> path, see a patch at the tail.

Thanks!  That is a good test to add.  You should add the patch to the series.

> 
> Test result:
> w/o     477.800000
> w/o-l   481.070000
> a       481.800000
> a-l     640.410000
> b       466.240000  (question a: b outperforms w/o ?)
> b-l     529.740000
> 
> Where w/o is baseline without any patch using v5.2-rc2, a is this series, b
> does the check in gup_pte_range(). '-l' means FOLL_LONGTERM.
> 
> I am suprised that b-l has about 17% improvement than a. (640.41 -529.74)/640.41

Wow that is bigger than I would have thought.  I suspect it gets worse as -n
increases?

>
> As for "question a: b outperforms w/o ?", I can not figure out why, maybe it can be
> considered as variance.

:-/

Does this change with larger -r or -n values?

> 
> Based on the above result, I think it is better to do the check inside
> gup_pte_range().
> 
> Any comment?

I agree.

Ira

> 
> Thanks,
> 
> 
> > Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I
> > will try to bring out the result.
> > 
> > Thanks,
> >   Pingfan
> > 
> 

> ---
> Patch to do check inside gup_pte_range()
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 2ce3091..ba213a0 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1757,6 +1757,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  		page = pte_page(pte);
>  
> +		if (unlikely(flags & FOLL_LONGTERM) &&
> +			is_migrate_cma_page(page))
> +				goto pte_unmap;
> +
>  		head = try_get_compound_head(page, 1);
>  		if (!head)
>  			goto pte_unmap;
> @@ -1900,6 +1904,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  		refs++;
>  	} while (addr += PAGE_SIZE, addr != end);
>  
> +	if (unlikely(flags & FOLL_LONGTERM) &&
> +		is_migrate_cma_page(page)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
>  	head = try_get_compound_head(pmd_page(orig), refs);
>  	if (!head) {
>  		*nr -= refs;
> @@ -1941,6 +1951,12 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>  		refs++;
>  	} while (addr += PAGE_SIZE, addr != end);
>  
> +	if (unlikely(flags & FOLL_LONGTERM) &&
> +		is_migrate_cma_page(page)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
>  	head = try_get_compound_head(pud_page(orig), refs);
>  	if (!head) {
>  		*nr -= refs;
> @@ -1978,6 +1994,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>  		refs++;
>  	} while (addr += PAGE_SIZE, addr != end);
>  
> +	if (unlikely(flags & FOLL_LONGTERM) &&
> +		is_migrate_cma_page(page)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
>  	head = try_get_compound_head(pgd_page(orig), refs);
>  	if (!head) {
>  		*nr -= refs;

> ---
> Patch for testing
> 
> diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> index 7dd602d..61dec5f 100644
> --- a/mm/gup_benchmark.c
> +++ b/mm/gup_benchmark.c
> @@ -6,8 +6,9 @@
>  #include <linux/debugfs.h>
>  
>  #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
> -#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
> -#define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
> +#define GUP_FAST_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
> +#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 3, struct gup_benchmark)
> +#define GUP_BENCHMARK		_IOWR('g', 4, struct gup_benchmark)
>  
>  struct gup_benchmark {
>  	__u64 get_delta_usec;
> @@ -53,6 +54,11 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
>  			nr = get_user_pages_fast(addr, nr, gup->flags & 1,
>  						 pages + i);
>  			break;
> +		case GUP_FAST_LONGTERM_BENCHMARK:
> +			nr = get_user_pages_fast(addr, nr,
> +						 (gup->flags & 1) | FOLL_LONGTERM,
> +						 pages + i);
> +			break;
>  		case GUP_LONGTERM_BENCHMARK:
>  			nr = get_user_pages(addr, nr,
>  					    (gup->flags & 1) | FOLL_LONGTERM,
> @@ -96,6 +102,7 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
>  
>  	switch (cmd) {
>  	case GUP_FAST_BENCHMARK:
> +	case GUP_FAST_LONGTERM_BENCHMARK:
>  	case GUP_LONGTERM_BENCHMARK:
>  	case GUP_BENCHMARK:
>  		break;
> diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
> index c0534e2..ade8acb 100644
> --- a/tools/testing/selftests/vm/gup_benchmark.c
> +++ b/tools/testing/selftests/vm/gup_benchmark.c
> @@ -15,8 +15,9 @@
>  #define PAGE_SIZE sysconf(_SC_PAGESIZE)
>  
>  #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
> -#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
> -#define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
> +#define GUP_FAST_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
> +#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 3, struct gup_benchmark)
> +#define GUP_BENCHMARK		_IOWR('g', 4, struct gup_benchmark)
>  
>  struct gup_benchmark {
>  	__u64 get_delta_usec;
> @@ -37,7 +38,7 @@ int main(int argc, char **argv)
>  	char *file = "/dev/zero";
>  	char *p;
>  
> -	while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) {
> +	while ((opt = getopt(argc, argv, "m:r:n:f:tTlLUSH")) != -1) {
>  		switch (opt) {
>  		case 'm':
>  			size = atoi(optarg) * MB;
> @@ -54,6 +55,9 @@ int main(int argc, char **argv)
>  		case 'T':
>  			thp = 0;
>  			break;
> +		case 'l':
> +			cmd = GUP_FAST_LONGTERM_BENCHMARK;
> +			break;
>  		case 'L':
>  			cmd = GUP_LONGTERM_BENCHMARK;
>  			break;
> -- 
> 2.7.5
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-11 13:52           ` Christoph Hellwig
@ 2019-06-11 19:49             ` John Hubbard
  0 siblings, 0 replies; 20+ messages in thread
From: John Hubbard @ 2019-06-11 19:49 UTC (permalink / raw)
  To: Christoph Hellwig, Pingfan Liu
  Cc: Andrew Morton, linux-mm, Ira Weiny, Mike Rapoport, Dan Williams,
	Matthew Wilcox, Aneesh Kumar K.V, Keith Busch, LKML

On 6/11/19 6:52 AM, Christoph Hellwig wrote:
> On Tue, Jun 11, 2019 at 08:29:35PM +0800, Pingfan Liu wrote:
>> Unable to get a NVME device to have a test. And when testing fio on the
> 
> How would a nvme test help?  FOLL_LONGTERM isn't used by any performance
> critical path to start with, so I don't see how this patch could be
> a problem.
> 

yes, you're right of course. We skip the loop entirely for FOLL_LONGTERM,
and I forgot for the moment that the direct IO paths are never going to
set that flag. :)

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-11 16:29   ` Weiny, Ira
@ 2019-06-12 13:54     ` Pingfan Liu
  2019-06-12 23:50       ` Ira Weiny
  0 siblings, 1 reply; 20+ messages in thread
From: Pingfan Liu @ 2019-06-12 13:54 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Aneesh Kumar K.V, linux-mm, Andrew Morton, Mike Rapoport,
	Williams, Dan J, Matthew Wilcox, John Hubbard, Busch, Keith,
	Christoph Hellwig, linux-kernel

On Tue, Jun 11, 2019 at 04:29:11PM +0000, Weiny, Ira wrote:
> > Pingfan Liu <kernelfans@gmail.com> writes:
> > 
> > > As for FOLL_LONGTERM, it is checked in the slow path
> > > __gup_longterm_unlocked(). But it is not checked in the fast path,
> > > which means a possible leak of CMA page to longterm pinned requirement
> > > through this crack.
> > 
> > Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
> > dax check we need vma to ensure whether a long term pin is allowed or not.
> > If FOLL_LONGTERM is specified we should fallback to slow path.
> 
> Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX.  But it does this while walking the page tables.  I missed the CMA case and Pingfan's patch fixes this.  We could check for CMA pages while walking the page tables but most agreed that it was not worth it.  For DAX we already had checks for *_devmap() so it was easier to put the FOLL_LONGTERM checks there.
> 
Then for CMA pages, are you suggesting something like:
diff --git a/mm/gup.c b/mm/gup.c
index 42a47c0..8bf3cc3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2251,6 +2251,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
        if (unlikely(!access_ok((void __user *)start, len)))
                return -EFAULT;

+       if (unlikely(gup_flags & FOLL_LONGTERM))
+               goto slow;
        if (gup_fast_permitted(start, nr_pages)) {
                local_irq_disable();
                gup_pgd_range(addr, end, gup_flags, pages, &nr);
@@ -2258,6 +2260,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
                ret = nr;
        }

+slow:
        if (nr < nr_pages) {
                /* Try to get the remaining pages with get_user_pages */
                start += nr << PAGE_SHIFT;

Thanks,
  Pingfan

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-11 16:47           ` Ira Weiny
@ 2019-06-12 14:10               ` Pingfan Liu
  0 siblings, 0 replies; 20+ messages in thread
From: Pingfan Liu @ 2019-06-12 14:10 UTC (permalink / raw)
  To: Ira Weiny
  Cc: John Hubbard, Andrew Morton, linux-mm, Mike Rapoport,
	Dan Williams, Matthew Wilcox, Aneesh Kumar K.V, Keith Busch,
	Christoph Hellwig, LKML

On Wed, Jun 12, 2019 at 12:46 AM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Tue, Jun 11, 2019 at 08:29:35PM +0800, Pingfan Liu wrote:
> > On Fri, Jun 07, 2019 at 02:10:15PM +0800, Pingfan Liu wrote:
> > > On Fri, Jun 7, 2019 at 5:17 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > > >
> > > > On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > > > > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > ...
> > > > >>> --- a/mm/gup.c
> > > > >>> +++ b/mm/gup.c
> > > > >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> > > > >>>       return ret;
> > > > >>>  }
> > > > >>>
> > > > >>> +#ifdef CONFIG_CMA
> > > > >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > > > >>> +{
> > > > >>> +     int i;
> > > > >>> +
> > > > >>> +     for (i = 0; i < nr_pinned; i++)
> > > > >>> +             if (is_migrate_cma_page(pages[i])) {
> > > > >>> +                     put_user_pages(pages + i, nr_pinned - i);
> > > > >>> +                     return i;
> > > > >>> +             }
> > > > >>> +
> > > > >>> +     return nr_pinned;
> > > > >>> +}
> > > > >>
> > > > >> There's no point in inlining this.
> > > > > OK, will drop it in V4.
> > > > >
> > > > >>
> > > > >> The code seems inefficient.  If it encounters a single CMA page it can
> > > > >> end up discarding a possibly significant number of non-CMA pages.  I
> > > > > The trick is the page is not be discarded, in fact, they are still be
> > > > > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > > > > pages again.
> > > > >
> > > > >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> > > > >> rare.  But could we avoid this (and the second pass across pages[]) by
> > > > >> checking for a CMA page within gup_pte_range()?
> > > > > It will spread the same logic to hugetlb pte and normal pte. And no
> > > > > improvement in performance due to slow path. So I think maybe it is
> > > > > not worth.
> > > > >
> > > > >>
> > > >
> > > > I think the concern is: for the successful gup_fast case with no CMA
> > > > pages, this patch is adding another complete loop through all the
> > > > pages. In the fast case.
> > > >
> > > > If the check were instead done as part of the gup_pte_range(), then
> > > > it would be a little more efficient for that case.
> > > >
> > > > As for whether it's worth it, *probably* this is too small an effect to measure.
> > > > But in order to attempt a measurement: running fio (https://github.com/axboe/fio)
> > > > with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file
> > > > that Jan Kara and Tom Talpey helped me come up with, for related testing:
> > > >
> > > > [reader]
> > > > direct=1
> > > > ioengine=libaio
> > > > blocksize=4096
> > > > size=1g
> > > > numjobs=1
> > > > rw=read
> > > > iodepth=64
> > > >
> > Unable to get a NVME device to have a test. And when testing fio on the
> > tranditional disk, I got the error "fio: engine libaio not loadable
> > fio: failed to load engine
> > fio: file:ioengines.c:89, func=dlopen, error=libaio: cannot open shared object file: No such file or directory"
> >
> > But I found a test case which can be slightly adjusted to met the aim.
> > It is tools/testing/selftests/vm/gup_benchmark.c
> >
> > Test enviroment:
> >   MemTotal:       264079324 kB
> >   MemFree:        262306788 kB
> >   CmaTotal:              0 kB
> >   CmaFree:               0 kB
> >   on AMD EPYC 7601
> >
> > Test command:
> >   gup_benchmark -r 100 -n 64
> >   gup_benchmark -r 100 -n 64 -l
> > where -r stands for repeat times, -n is nr_pages param for
> > get_user_pages_fast(), -l is a new option to test FOLL_LONGTERM in fast
> > path, see a patch at the tail.
>
> Thanks!  That is a good test to add.  You should add the patch to the series.
OK.
>
> >
> > Test result:
> > w/o     477.800000
> > w/o-l   481.070000
> > a       481.800000
> > a-l     640.410000
> > b       466.240000  (question a: b outperforms w/o ?)
> > b-l     529.740000
> >
> > Where w/o is baseline without any patch using v5.2-rc2, a is this series, b
> > does the check in gup_pte_range(). '-l' means FOLL_LONGTERM.
> >
> > I am suprised that b-l has about 17% improvement than a. (640.41 -529.74)/640.41
>
> Wow that is bigger than I would have thought.  I suspect it gets worse as -n
> increases?
Yes. I test with -n 64/128/256/512. It has this trend. See the data below.

>
> >
> > As for "question a: b outperforms w/o ?", I can not figure out why, maybe it can be
> > considered as variance.
>
> :-/
>
> Does this change with larger -r or -n values?
-r should have no effect on this. And I change -n 64/128/256/512. The
data always shows b outperforms w/o a bit.

      64        128         256        512
a-l  633.23   676.83  747.14  683.19    (n=256 should be disturbed by
something, but the overall trend keeps going up)
b-l  528.32   529.10  523.95  512.88
w/o  479.73   473.87  477.67  488.70
b    470.13   467.11  463.06  469.62

Thanks,
  Pingfan
>
> >
> > Based on the above result, I think it is better to do the check inside
> > gup_pte_range().
> >
> > Any comment?
>
> I agree.
>
> Ira
>
> >
> > Thanks,
> >
> >
> > > Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I
> > > will try to bring out the result.
> > >
> > > Thanks,
> > >   Pingfan
> > >
> >
>
> > ---
> > Patch to do check inside gup_pte_range()
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 2ce3091..ba213a0 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1757,6 +1757,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >               VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> >               page = pte_page(pte);
> >
> > +             if (unlikely(flags & FOLL_LONGTERM) &&
> > +                     is_migrate_cma_page(page))
> > +                             goto pte_unmap;
> > +
> >               head = try_get_compound_head(page, 1);
> >               if (!head)
> >                       goto pte_unmap;
> > @@ -1900,6 +1904,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> >               refs++;
> >       } while (addr += PAGE_SIZE, addr != end);
> >
> > +     if (unlikely(flags & FOLL_LONGTERM) &&
> > +             is_migrate_cma_page(page)) {
> > +             *nr -= refs;
> > +             return 0;
> > +     }
> > +
> >       head = try_get_compound_head(pmd_page(orig), refs);
> >       if (!head) {
> >               *nr -= refs;
> > @@ -1941,6 +1951,12 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> >               refs++;
> >       } while (addr += PAGE_SIZE, addr != end);
> >
> > +     if (unlikely(flags & FOLL_LONGTERM) &&
> > +             is_migrate_cma_page(page)) {
> > +             *nr -= refs;
> > +             return 0;
> > +     }
> > +
> >       head = try_get_compound_head(pud_page(orig), refs);
> >       if (!head) {
> >               *nr -= refs;
> > @@ -1978,6 +1994,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
> >               refs++;
> >       } while (addr += PAGE_SIZE, addr != end);
> >
> > +     if (unlikely(flags & FOLL_LONGTERM) &&
> > +             is_migrate_cma_page(page)) {
> > +             *nr -= refs;
> > +             return 0;
> > +     }
> > +
> >       head = try_get_compound_head(pgd_page(orig), refs);
> >       if (!head) {
> >               *nr -= refs;
>
> > ---
> > Patch for testing
> >
> > diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> > index 7dd602d..61dec5f 100644
> > --- a/mm/gup_benchmark.c
> > +++ b/mm/gup_benchmark.c
> > @@ -6,8 +6,9 @@
> >  #include <linux/debugfs.h>
> >
> >  #define GUP_FAST_BENCHMARK   _IOWR('g', 1, struct gup_benchmark)
> > -#define GUP_LONGTERM_BENCHMARK       _IOWR('g', 2, struct gup_benchmark)
> > -#define GUP_BENCHMARK                _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_FAST_LONGTERM_BENCHMARK  _IOWR('g', 2, struct gup_benchmark)
> > +#define GUP_LONGTERM_BENCHMARK       _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_BENCHMARK                _IOWR('g', 4, struct gup_benchmark)
> >
> >  struct gup_benchmark {
> >       __u64 get_delta_usec;
> > @@ -53,6 +54,11 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
> >                       nr = get_user_pages_fast(addr, nr, gup->flags & 1,
> >                                                pages + i);
> >                       break;
> > +             case GUP_FAST_LONGTERM_BENCHMARK:
> > +                     nr = get_user_pages_fast(addr, nr,
> > +                                              (gup->flags & 1) | FOLL_LONGTERM,
> > +                                              pages + i);
> > +                     break;
> >               case GUP_LONGTERM_BENCHMARK:
> >                       nr = get_user_pages(addr, nr,
> >                                           (gup->flags & 1) | FOLL_LONGTERM,
> > @@ -96,6 +102,7 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
> >
> >       switch (cmd) {
> >       case GUP_FAST_BENCHMARK:
> > +     case GUP_FAST_LONGTERM_BENCHMARK:
> >       case GUP_LONGTERM_BENCHMARK:
> >       case GUP_BENCHMARK:
> >               break;
> > diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
> > index c0534e2..ade8acb 100644
> > --- a/tools/testing/selftests/vm/gup_benchmark.c
> > +++ b/tools/testing/selftests/vm/gup_benchmark.c
> > @@ -15,8 +15,9 @@
> >  #define PAGE_SIZE sysconf(_SC_PAGESIZE)
> >
> >  #define GUP_FAST_BENCHMARK   _IOWR('g', 1, struct gup_benchmark)
> > -#define GUP_LONGTERM_BENCHMARK       _IOWR('g', 2, struct gup_benchmark)
> > -#define GUP_BENCHMARK                _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_FAST_LONGTERM_BENCHMARK  _IOWR('g', 2, struct gup_benchmark)
> > +#define GUP_LONGTERM_BENCHMARK       _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_BENCHMARK                _IOWR('g', 4, struct gup_benchmark)
> >
> >  struct gup_benchmark {
> >       __u64 get_delta_usec;
> > @@ -37,7 +38,7 @@ int main(int argc, char **argv)
> >       char *file = "/dev/zero";
> >       char *p;
> >
> > -     while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) {
> > +     while ((opt = getopt(argc, argv, "m:r:n:f:tTlLUSH")) != -1) {
> >               switch (opt) {
> >               case 'm':
> >                       size = atoi(optarg) * MB;
> > @@ -54,6 +55,9 @@ int main(int argc, char **argv)
> >               case 'T':
> >                       thp = 0;
> >                       break;
> > +             case 'l':
> > +                     cmd = GUP_FAST_LONGTERM_BENCHMARK;
> > +                     break;
> >               case 'L':
> >                       cmd = GUP_LONGTERM_BENCHMARK;
> >                       break;
> > --
> > 2.7.5
> >
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
@ 2019-06-12 14:10               ` Pingfan Liu
  0 siblings, 0 replies; 20+ messages in thread
From: Pingfan Liu @ 2019-06-12 14:10 UTC (permalink / raw)
  To: Ira Weiny
  Cc: John Hubbard, Andrew Morton, linux-mm, Mike Rapoport,
	Dan Williams, Matthew Wilcox, Aneesh Kumar K.V, Keith Busch,
	Christoph Hellwig, LKML

On Wed, Jun 12, 2019 at 12:46 AM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Tue, Jun 11, 2019 at 08:29:35PM +0800, Pingfan Liu wrote:
> > On Fri, Jun 07, 2019 at 02:10:15PM +0800, Pingfan Liu wrote:
> > > On Fri, Jun 7, 2019 at 5:17 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > > >
> > > > On 6/5/19 7:19 PM, Pingfan Liu wrote:
> > > > > On Thu, Jun 6, 2019 at 5:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > ...
> > > > >>> --- a/mm/gup.c
> > > > >>> +++ b/mm/gup.c
> > > > >>> @@ -2196,6 +2196,26 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> > > > >>>       return ret;
> > > > >>>  }
> > > > >>>
> > > > >>> +#ifdef CONFIG_CMA
> > > > >>> +static inline int reject_cma_pages(int nr_pinned, struct page **pages)
> > > > >>> +{
> > > > >>> +     int i;
> > > > >>> +
> > > > >>> +     for (i = 0; i < nr_pinned; i++)
> > > > >>> +             if (is_migrate_cma_page(pages[i])) {
> > > > >>> +                     put_user_pages(pages + i, nr_pinned - i);
> > > > >>> +                     return i;
> > > > >>> +             }
> > > > >>> +
> > > > >>> +     return nr_pinned;
> > > > >>> +}
> > > > >>
> > > > >> There's no point in inlining this.
> > > > > OK, will drop it in V4.
> > > > >
> > > > >>
> > > > >> The code seems inefficient.  If it encounters a single CMA page it can
> > > > >> end up discarding a possibly significant number of non-CMA pages.  I
> > > > > The trick is the page is not be discarded, in fact, they are still be
> > > > > referrenced by pte. We just leave the slow path to pick up the non-CMA
> > > > > pages again.
> > > > >
> > > > >> guess that doesn't matter much, as get_user_pages(FOLL_LONGTERM) is
> > > > >> rare.  But could we avoid this (and the second pass across pages[]) by
> > > > >> checking for a CMA page within gup_pte_range()?
> > > > > It will spread the same logic to hugetlb pte and normal pte. And no
> > > > > improvement in performance due to slow path. So I think maybe it is
> > > > > not worth.
> > > > >
> > > > >>
> > > >
> > > > I think the concern is: for the successful gup_fast case with no CMA
> > > > pages, this patch is adding another complete loop through all the
> > > > pages. In the fast case.
> > > >
> > > > If the check were instead done as part of the gup_pte_range(), then
> > > > it would be a little more efficient for that case.
> > > >
> > > > As for whether it's worth it, *probably* this is too small an effect to measure.
> > > > But in order to attempt a measurement: running fio (https://github.com/axboe/fio)
> > > > with O_DIRECT on an NVMe drive, might shed some light. Here's an fio.conf file
> > > > that Jan Kara and Tom Talpey helped me come up with, for related testing:
> > > >
> > > > [reader]
> > > > direct=1
> > > > ioengine=libaio
> > > > blocksize=4096
> > > > size=1g
> > > > numjobs=1
> > > > rw=read
> > > > iodepth=64
> > > >
> > Unable to get a NVME device to have a test. And when testing fio on the
> > tranditional disk, I got the error "fio: engine libaio not loadable
> > fio: failed to load engine
> > fio: file:ioengines.c:89, func=dlopen, error=libaio: cannot open shared object file: No such file or directory"
> >
> > But I found a test case which can be slightly adjusted to met the aim.
> > It is tools/testing/selftests/vm/gup_benchmark.c
> >
> > Test enviroment:
> >   MemTotal:       264079324 kB
> >   MemFree:        262306788 kB
> >   CmaTotal:              0 kB
> >   CmaFree:               0 kB
> >   on AMD EPYC 7601
> >
> > Test command:
> >   gup_benchmark -r 100 -n 64
> >   gup_benchmark -r 100 -n 64 -l
> > where -r stands for repeat times, -n is nr_pages param for
> > get_user_pages_fast(), -l is a new option to test FOLL_LONGTERM in fast
> > path, see a patch at the tail.
>
> Thanks!  That is a good test to add.  You should add the patch to the series.
OK.
>
> >
> > Test result:
> > w/o     477.800000
> > w/o-l   481.070000
> > a       481.800000
> > a-l     640.410000
> > b       466.240000  (question a: b outperforms w/o ?)
> > b-l     529.740000
> >
> > Where w/o is baseline without any patch using v5.2-rc2, a is this series, b
> > does the check in gup_pte_range(). '-l' means FOLL_LONGTERM.
> >
> > I am suprised that b-l has about 17% improvement than a. (640.41 -529.74)/640.41
>
> Wow that is bigger than I would have thought.  I suspect it gets worse as -n
> increases?
Yes. I test with -n 64/128/256/512. It has this trend. See the data below.

>
> >
> > As for "question a: b outperforms w/o ?", I can not figure out why, maybe it can be
> > considered as variance.
>
> :-/
>
> Does this change with larger -r or -n values?
-r should have no effect on this. And I change -n 64/128/256/512. The
data always shows b outperforms w/o a bit.

      64        128         256        512
a-l  633.23   676.83  747.14  683.19    (n=256 should be disturbed by
something, but the overall trend keeps going up)
b-l  528.32   529.10  523.95  512.88
w/o  479.73   473.87  477.67  488.70
b    470.13   467.11  463.06  469.62

Thanks,
  Pingfan
>
> >
> > Based on the above result, I think it is better to do the check inside
> > gup_pte_range().
> >
> > Any comment?
>
> I agree.
>
> Ira
>
> >
> > Thanks,
> >
> >
> > > Yeah, agreed. Data is more persuasive. Thanks for your suggestion. I
> > > will try to bring out the result.
> > >
> > > Thanks,
> > >   Pingfan
> > >
> >
>
> > ---
> > Patch to do check inside gup_pte_range()
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 2ce3091..ba213a0 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1757,6 +1757,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >               VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> >               page = pte_page(pte);
> >
> > +             if (unlikely(flags & FOLL_LONGTERM) &&
> > +                     is_migrate_cma_page(page))
> > +                             goto pte_unmap;
> > +
> >               head = try_get_compound_head(page, 1);
> >               if (!head)
> >                       goto pte_unmap;
> > @@ -1900,6 +1904,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> >               refs++;
> >       } while (addr += PAGE_SIZE, addr != end);
> >
> > +     if (unlikely(flags & FOLL_LONGTERM) &&
> > +             is_migrate_cma_page(page)) {
> > +             *nr -= refs;
> > +             return 0;
> > +     }
> > +
> >       head = try_get_compound_head(pmd_page(orig), refs);
> >       if (!head) {
> >               *nr -= refs;
> > @@ -1941,6 +1951,12 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> >               refs++;
> >       } while (addr += PAGE_SIZE, addr != end);
> >
> > +     if (unlikely(flags & FOLL_LONGTERM) &&
> > +             is_migrate_cma_page(page)) {
> > +             *nr -= refs;
> > +             return 0;
> > +     }
> > +
> >       head = try_get_compound_head(pud_page(orig), refs);
> >       if (!head) {
> >               *nr -= refs;
> > @@ -1978,6 +1994,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
> >               refs++;
> >       } while (addr += PAGE_SIZE, addr != end);
> >
> > +     if (unlikely(flags & FOLL_LONGTERM) &&
> > +             is_migrate_cma_page(page)) {
> > +             *nr -= refs;
> > +             return 0;
> > +     }
> > +
> >       head = try_get_compound_head(pgd_page(orig), refs);
> >       if (!head) {
> >               *nr -= refs;
>
> > ---
> > Patch for testing
> >
> > diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> > index 7dd602d..61dec5f 100644
> > --- a/mm/gup_benchmark.c
> > +++ b/mm/gup_benchmark.c
> > @@ -6,8 +6,9 @@
> >  #include <linux/debugfs.h>
> >
> >  #define GUP_FAST_BENCHMARK   _IOWR('g', 1, struct gup_benchmark)
> > -#define GUP_LONGTERM_BENCHMARK       _IOWR('g', 2, struct gup_benchmark)
> > -#define GUP_BENCHMARK                _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_FAST_LONGTERM_BENCHMARK  _IOWR('g', 2, struct gup_benchmark)
> > +#define GUP_LONGTERM_BENCHMARK       _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_BENCHMARK                _IOWR('g', 4, struct gup_benchmark)
> >
> >  struct gup_benchmark {
> >       __u64 get_delta_usec;
> > @@ -53,6 +54,11 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
> >                       nr = get_user_pages_fast(addr, nr, gup->flags & 1,
> >                                                pages + i);
> >                       break;
> > +             case GUP_FAST_LONGTERM_BENCHMARK:
> > +                     nr = get_user_pages_fast(addr, nr,
> > +                                              (gup->flags & 1) | FOLL_LONGTERM,
> > +                                              pages + i);
> > +                     break;
> >               case GUP_LONGTERM_BENCHMARK:
> >                       nr = get_user_pages(addr, nr,
> >                                           (gup->flags & 1) | FOLL_LONGTERM,
> > @@ -96,6 +102,7 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
> >
> >       switch (cmd) {
> >       case GUP_FAST_BENCHMARK:
> > +     case GUP_FAST_LONGTERM_BENCHMARK:
> >       case GUP_LONGTERM_BENCHMARK:
> >       case GUP_BENCHMARK:
> >               break;
> > diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
> > index c0534e2..ade8acb 100644
> > --- a/tools/testing/selftests/vm/gup_benchmark.c
> > +++ b/tools/testing/selftests/vm/gup_benchmark.c
> > @@ -15,8 +15,9 @@
> >  #define PAGE_SIZE sysconf(_SC_PAGESIZE)
> >
> >  #define GUP_FAST_BENCHMARK   _IOWR('g', 1, struct gup_benchmark)
> > -#define GUP_LONGTERM_BENCHMARK       _IOWR('g', 2, struct gup_benchmark)
> > -#define GUP_BENCHMARK                _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_FAST_LONGTERM_BENCHMARK  _IOWR('g', 2, struct gup_benchmark)
> > +#define GUP_LONGTERM_BENCHMARK       _IOWR('g', 3, struct gup_benchmark)
> > +#define GUP_BENCHMARK                _IOWR('g', 4, struct gup_benchmark)
> >
> >  struct gup_benchmark {
> >       __u64 get_delta_usec;
> > @@ -37,7 +38,7 @@ int main(int argc, char **argv)
> >       char *file = "/dev/zero";
> >       char *p;
> >
> > -     while ((opt = getopt(argc, argv, "m:r:n:f:tTLUSH")) != -1) {
> > +     while ((opt = getopt(argc, argv, "m:r:n:f:tTlLUSH")) != -1) {
> >               switch (opt) {
> >               case 'm':
> >                       size = atoi(optarg) * MB;
> > @@ -54,6 +55,9 @@ int main(int argc, char **argv)
> >               case 'T':
> >                       thp = 0;
> >                       break;
> > +             case 'l':
> > +                     cmd = GUP_FAST_LONGTERM_BENCHMARK;
> > +                     break;
> >               case 'L':
> >                       cmd = GUP_LONGTERM_BENCHMARK;
> >                       break;
> > --
> > 2.7.5
> >
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-12 13:54     ` Pingfan Liu
@ 2019-06-12 23:50       ` Ira Weiny
  2019-06-13 10:48           ` Pingfan Liu
  0 siblings, 1 reply; 20+ messages in thread
From: Ira Weiny @ 2019-06-12 23:50 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: Aneesh Kumar K.V, linux-mm, Andrew Morton, Mike Rapoport,
	Williams, Dan J, Matthew Wilcox, John Hubbard, Busch, Keith,
	Christoph Hellwig, linux-kernel

On Wed, Jun 12, 2019 at 09:54:58PM +0800, Pingfan Liu wrote:
> On Tue, Jun 11, 2019 at 04:29:11PM +0000, Weiny, Ira wrote:
> > > Pingfan Liu <kernelfans@gmail.com> writes:
> > > 
> > > > As for FOLL_LONGTERM, it is checked in the slow path
> > > > __gup_longterm_unlocked(). But it is not checked in the fast path,
> > > > which means a possible leak of CMA page to longterm pinned requirement
> > > > through this crack.
> > > 
> > > Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
> > > dax check we need vma to ensure whether a long term pin is allowed or not.
> > > If FOLL_LONGTERM is specified we should fallback to slow path.
> > 
> > Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX.  But it does this while walking the page tables.  I missed the CMA case and Pingfan's patch fixes this.  We could check for CMA pages while walking the page tables but most agreed that it was not worth it.  For DAX we already had checks for *_devmap() so it was easier to put the FOLL_LONGTERM checks there.
> > 
> Then for CMA pages, are you suggesting something like:

I'm not suggesting this.

Sorry I wrote this prior to seeing the numbers in your other email.  Given
the numbers it looks like performing the check whilst walking the tables is
worth the extra complexity.  I was just trying to summarize the thread.  I
don't think we should disallow FOLL_LONGTERM because it only affects CMA and
DAX.  Other pages will be fine with FOLL_LONGTERM.  Why penalize every call if
we don't have to.  Also in the case of DAX the use of vma will be going
away...[1]  Eventually...  ;-)

Ira

[1] https://lkml.org/lkml/2019/6/5/1049

> diff --git a/mm/gup.c b/mm/gup.c
> index 42a47c0..8bf3cc3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2251,6 +2251,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>         if (unlikely(!access_ok((void __user *)start, len)))
>                 return -EFAULT;
> 
> +       if (unlikely(gup_flags & FOLL_LONGTERM))
> +               goto slow;
>         if (gup_fast_permitted(start, nr_pages)) {
>                 local_irq_disable();
>                 gup_pgd_range(addr, end, gup_flags, pages, &nr);
> @@ -2258,6 +2260,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>                 ret = nr;
>         }
> 
> +slow:
>         if (nr < nr_pages) {
>                 /* Try to get the remaining pages with get_user_pages */
>                 start += nr << PAGE_SHIFT;
> 
> Thanks,
>   Pingfan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
  2019-06-12 23:50       ` Ira Weiny
@ 2019-06-13 10:48           ` Pingfan Liu
  0 siblings, 0 replies; 20+ messages in thread
From: Pingfan Liu @ 2019-06-13 10:48 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Aneesh Kumar K.V, linux-mm, Andrew Morton, Mike Rapoport,
	Williams, Dan J, Matthew Wilcox, John Hubbard, Busch, Keith,
	Christoph Hellwig, linux-kernel

On Thu, Jun 13, 2019 at 7:49 AM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Wed, Jun 12, 2019 at 09:54:58PM +0800, Pingfan Liu wrote:
> > On Tue, Jun 11, 2019 at 04:29:11PM +0000, Weiny, Ira wrote:
> > > > Pingfan Liu <kernelfans@gmail.com> writes:
> > > >
> > > > > As for FOLL_LONGTERM, it is checked in the slow path
> > > > > __gup_longterm_unlocked(). But it is not checked in the fast path,
> > > > > which means a possible leak of CMA page to longterm pinned requirement
> > > > > through this crack.
> > > >
> > > > Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
> > > > dax check we need vma to ensure whether a long term pin is allowed or not.
> > > > If FOLL_LONGTERM is specified we should fallback to slow path.
> > >
> > > Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX.  But it does this while walking the page tables.  I missed the CMA case and Pingfan's patch fixes this.  We could check for CMA pages while walking the page tables but most agreed that it was not worth it.  For DAX we already had checks for *_devmap() so it was easier to put the FOLL_LONGTERM checks there.
> > >
> > Then for CMA pages, are you suggesting something like:
>
> I'm not suggesting this.
OK, then I send out v4.
>
> Sorry I wrote this prior to seeing the numbers in your other email.  Given
> the numbers it looks like performing the check whilst walking the tables is
> worth the extra complexity.  I was just trying to summarize the thread.  I
> don't think we should disallow FOLL_LONGTERM because it only affects CMA and
> DAX.  Other pages will be fine with FOLL_LONGTERM.  Why penalize every call if
> we don't have to.  Also in the case of DAX the use of vma will be going
> away...[1]  Eventually...  ;-)
A good feature. Trying to catch up.

Thanks,
Pingfan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()
@ 2019-06-13 10:48           ` Pingfan Liu
  0 siblings, 0 replies; 20+ messages in thread
From: Pingfan Liu @ 2019-06-13 10:48 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Aneesh Kumar K.V, linux-mm, Andrew Morton, Mike Rapoport,
	Williams, Dan J, Matthew Wilcox, John Hubbard, Busch, Keith,
	Christoph Hellwig, linux-kernel

On Thu, Jun 13, 2019 at 7:49 AM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Wed, Jun 12, 2019 at 09:54:58PM +0800, Pingfan Liu wrote:
> > On Tue, Jun 11, 2019 at 04:29:11PM +0000, Weiny, Ira wrote:
> > > > Pingfan Liu <kernelfans@gmail.com> writes:
> > > >
> > > > > As for FOLL_LONGTERM, it is checked in the slow path
> > > > > __gup_longterm_unlocked(). But it is not checked in the fast path,
> > > > > which means a possible leak of CMA page to longterm pinned requirement
> > > > > through this crack.
> > > >
> > > > Shouldn't we disallow FOLL_LONGTERM with get_user_pages fastpath? W.r.t
> > > > dax check we need vma to ensure whether a long term pin is allowed or not.
> > > > If FOLL_LONGTERM is specified we should fallback to slow path.
> > >
> > > Yes, the fastpath bails to the slowpath if FOLL_LONGTERM _and_ DAX.  But it does this while walking the page tables.  I missed the CMA case and Pingfan's patch fixes this.  We could check for CMA pages while walking the page tables but most agreed that it was not worth it.  For DAX we already had checks for *_devmap() so it was easier to put the FOLL_LONGTERM checks there.
> > >
> > Then for CMA pages, are you suggesting something like:
>
> I'm not suggesting this.
OK, then I send out v4.
>
> Sorry I wrote this prior to seeing the numbers in your other email.  Given
> the numbers it looks like performing the check whilst walking the tables is
> worth the extra complexity.  I was just trying to summarize the thread.  I
> don't think we should disallow FOLL_LONGTERM because it only affects CMA and
> DAX.  Other pages will be fine with FOLL_LONGTERM.  Why penalize every call if
> we don't have to.  Also in the case of DAX the use of vma will be going
> away...[1]  Eventually...  ;-)
A good feature. Trying to catch up.

Thanks,
Pingfan


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-06-13 17:01 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-05  9:10 [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast() Pingfan Liu
2019-06-05  9:10 ` [PATCHv3 2/2] mm/gup: rename nr as nr_pinned " Pingfan Liu
2019-06-05 21:49 ` [PATCHv3 1/2] mm/gup: fix omission of check on FOLL_LONGTERM " Andrew Morton
2019-06-06  2:19   ` Pingfan Liu
2019-06-06  2:19     ` Pingfan Liu
2019-06-06 21:17     ` John Hubbard
2019-06-07  6:10       ` Pingfan Liu
2019-06-07  6:10         ` Pingfan Liu
2019-06-11 12:29         ` Pingfan Liu
2019-06-11 13:52           ` Christoph Hellwig
2019-06-11 19:49             ` John Hubbard
2019-06-11 16:47           ` Ira Weiny
2019-06-12 14:10             ` Pingfan Liu
2019-06-12 14:10               ` Pingfan Liu
2019-06-11 16:15 ` Aneesh Kumar K.V
2019-06-11 16:29   ` Weiny, Ira
2019-06-12 13:54     ` Pingfan Liu
2019-06-12 23:50       ` Ira Weiny
2019-06-13 10:48         ` Pingfan Liu
2019-06-13 10:48           ` Pingfan Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.