linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jerome Glisse <jglisse@redhat.com>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	"Dave Chinner" <david@fromorbit.com>,
	"David Airlie" <airlied@linux.ie>,
	"David S . Miller" <davem@davemloft.net>,
	"Ira Weiny" <ira.weiny@intel.com>, "Jan Kara" <jack@suse.cz>,
	"Jason Gunthorpe" <jgg@ziepe.ca>, "Jens Axboe" <axboe@kernel.dk>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Mauro Carvalho Chehab" <mchehab@kernel.org>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Michal Hocko" <mhocko@suse.com>,
	"Mike Kravetz" <mike.kravetz@oracle.com>,
	"Paul Mackerras" <paulus@samba.org>,
	"Shuah Khan" <shuah@kernel.org>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	bpf@vger.kernel.org, dri-devel@lists.freedesktop.org,
	kvm@vger.kernel.org, linux-block@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org,
	linux-rdma@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	netdev@vger.kernel.org, linux-mm@kvack.org,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages
Date: Mon, 4 Nov 2019 18:49:20 -0500	[thread overview]
Message-ID: <20191104234920.GA18515@redhat.com> (raw)
In-Reply-To: <7821cf87-75a8-45e2-cf28-f85b62192416@nvidia.com>

On Mon, Nov 04, 2019 at 02:49:18PM -0800, John Hubbard wrote:
> On 11/4/19 10:52 AM, Jerome Glisse wrote:
> > On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
> >> Add tracking of pages that were pinned via FOLL_PIN.
> >>
> >> As mentioned in the FOLL_PIN documentation, callers who effectively set
> >> FOLL_PIN are required to ultimately free such pages via put_user_page().
> >> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> >> for DIO and/or RDMA use".
> >>
> >> Pages that have been pinned via FOLL_PIN are identifiable via a
> >> new function call:
> >>
> >>    bool page_dma_pinned(struct page *page);
> >>
> >> What to do in response to encountering such a page, is left to later
> >> patchsets. There is discussion about this in [1].
> >>
> >> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> >>
> >> This also has a couple of trivial, non-functional change fixes to
> >> try_get_compound_head(). That function got moved to the top of the
> >> file.
> > 
> > Maybe split that as a separate trivial patch.
> 
> 
> Will do.
> 
> 
> > 
> >>
> >> This includes the following fix from Ira Weiny:
> >>
> >> DAX requires detection of a page crossing to a ref count of 1.  Fix this
> >> for GUP pages by introducing put_devmap_managed_user_page() which
> >> accounts for GUP_PIN_COUNTING_BIAS now used by GUP.
> > 
> > Please do the put_devmap_managed_page() changes in a separate
> > patch, it would be a lot easier to follow, also on that front
> > see comments below.
> 
> 
> Oh! OK. It makes sense when you say it out loud. :)
> 
> 
> ...
> >> +static inline bool put_devmap_managed_page(struct page *page)
> >> +{
> >> +	bool is_devmap = page_is_devmap_managed(page);
> >> +
> >> +	if (is_devmap) {
> >> +		int count = page_ref_dec_return(page);
> >> +
> >> +		__put_devmap_managed_page(page, count);
> >> +	}
> >> +
> >> +	return is_devmap;
> >> +}
> > 
> > I think the __put_devmap_managed_page() should be rename
> > to free_devmap_managed_page() and that the count != 1
> > case move to this inline function ie:
> > 
> > static inline bool put_devmap_managed_page(struct page *page)
> > {
> > 	bool is_devmap = page_is_devmap_managed(page);
> > 
> > 	if (is_devmap) {
> > 		int count = page_ref_dec_return(page);
> > 
> > 		/*
> > 		 * If refcount is 1 then page is freed and refcount is stable as nobody
> > 		 * holds a reference on the page.
> > 		 */
> > 		if (count == 1)
> > 			free_devmap_managed_page(page, count);
> > 		else if (!count)
> > 			__put_page(page);
> > 	}
> > 
> > 	return is_devmap;
> > }
> > 
> 
> Thanks, that does look cleaner and easier to read.
> 
> > 
> >> +
> >>  #else /* CONFIG_DEV_PAGEMAP_OPS */
> >>  static inline bool put_devmap_managed_page(struct page *page)
> >>  {
> >> @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page)
> >>  	return true;
> >>  }
> >>  
> >> +__must_check bool user_page_ref_inc(struct page *page);
> >> +
> > 
> > What about having it as an inline here as it is pretty small.
> 
> 
> You mean move it to a static inline function in mm.h? It's worse than it 
> looks, though: *everything* that it calls is also a static function, local
> to gup.c. So I'd have to expose both try_get_compound_head() and
> __update_proc_vmstat(). And that also means calling mod_node_page_state() from
> mm.h, and it goes south right about there. :)

Ok fair enough

> ...  
> >> +/**
> >> + * page_dma_pinned() - report if a page is pinned by a call to pin_user_pages*()
> >> + * or pin_longterm_pages*()
> >> + * @page:	pointer to page to be queried.
> >> + * @Return:	True, if it is likely that the page has been "dma-pinned".
> >> + *		False, if the page is definitely not dma-pinned.
> >> + */
> > 
> > Maybe add a small comment about wrap around :)
> 
> 
> I don't *think* the count can wrap around, due to the checks in user_page_ref_inc().
> 
> But it's true that the documentation is a little light here...What did you have 
> in mind?

About false positive case (and how unlikely they are) and that wrap
around is properly handle. Maybe just a pointer to the documentation
so that people know they can go look there for details. I know my
brain tend to forget where to look for things so i like to be constantly
reminded hey the doc is Documentations/foobar :)

> > [...]
> > 
> >> @@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >>  
> >>  		pgmap = get_dev_pagemap(pfn, pgmap);
> >>  		if (unlikely(!pgmap)) {
> >> -			undo_dev_pagemap(nr, nr_start, pages);
> >> +			undo_dev_pagemap(nr, nr_start, flags, pages);
> >>  			return 0;
> >>  		}
> >>  		SetPageReferenced(page);
> >>  		pages[*nr] = page;
> >> -		get_page(page);
> >> +
> >> +		if (flags & FOLL_PIN) {
> >> +			if (unlikely(!user_page_ref_inc(page))) {
> >> +				undo_dev_pagemap(nr, nr_start, flags, pages);
> >> +				return 0;
> >> +			}
> > 
> > Maybe add a comment about a case that should never happens ie
> > user_page_ref_inc() fails after the second iteration of the
> > loop as it would be broken and a bug to call undo_dev_pagemap()
> > after the first iteration of that loop.
> > 
> > Also i believe that this should never happens as if first
> > iteration succeed than __page_cache_add_speculative() will
> > succeed for all the iterations.
> > 
> > Note that the pgmap case above follows that too ie the call to
> > get_dev_pagemap() can only fail on first iteration of the loop,
> > well i assume you can never have a huge device page that span
> > different pgmap ie different devices (which is a reasonable
> > assumption). So maybe this code needs fixing ie :
> > 
> > 		pgmap = get_dev_pagemap(pfn, pgmap);
> > 		if (unlikely(!pgmap))
> > 			return 0;
> > 
> > 
> 
> OK, yes that does make sense. And I think a comment is adequate,
> no need to check for bugs during every tail page iteration. So how 
> about this, as a preliminary patch:

Actualy i thought about it and i think that there is pgmap
per section and thus maybe one device can have multiple pgmap
and that would be an issue for page bigger than section size
(ie bigger than 128MB iirc). I will go double check that, but
maybe Dan can chime in.

In any case my comment above is correct for the page ref
increment, if the first one succeed than others will too
or otherwise it means someone is doing too many put_page()/
put_user_page() which is _bad_ :)

> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a335ae9..a4a81e125832 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1892,17 +1892,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>                 unsigned long end, struct page **pages, int *nr)
>  {
> -       int nr_start = *nr;
> -       struct dev_pagemap *pgmap = NULL;
> +       /*
> +        * Huge pages should never cross dev_pagemap boundaries. Therefore, use
> +        * this same pgmap for the entire huge page.
> +        */
> +       struct dev_pagemap *pgmap = get_dev_pagemap(pfn, NULL);
> +
> +       if (unlikely(!pgmap))
> +               return 0;
>  
>         do {
>                 struct page *page = pfn_to_page(pfn);
>  
> -               pgmap = get_dev_pagemap(pfn, pgmap);
> -               if (unlikely(!pgmap)) {
> -                       undo_dev_pagemap(nr, nr_start, pages);
> -                       return 0;
> -               }
>                 SetPageReferenced(page);
>                 pages[*nr] = page;
>                 get_page(page);
> 
> 
> 
> 
> >> +		} else
> >> +			get_page(page);
> >> +
> >>  		(*nr)++;
> >>  		pfn++;
> >>  	} while (addr += PAGE_SIZE, addr != end);
> > 
> > [...]
> > 
> >> @@ -2409,7 +2540,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
> >>  	unsigned long addr, len, end;
> >>  	int nr = 0, ret = 0;
> >>  
> >> -	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
> >> +	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
> > 
> > Maybe add a comments to explain, something like:
> > 
> > /*
> >  * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN
> >  *
> >  * Note that get_user_pages_fast() imply FOLL_GET flag by default but
> >  * callers can over-ride this default to pin case by setting FOLL_PIN.
> >  */
> 
> Good idea. Here's the draft now:
> 
> /*
>  * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN.
>  *
>  * Note that get_user_pages_fast() implies FOLL_GET flag by default, but
>  * callers can override this default by setting FOLL_PIN instead of
>  * FOLL_GET.
>  */
> if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
>         return -EINVAL;

Looks good to me.

...

Cheers,
Jérôme


  reply	other threads:[~2019-11-04 23:49 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
2019-11-03 21:17 ` [PATCH v2 01/18] mm/gup: pass flags arg to __gup_device_* functions John Hubbard
2019-11-04 16:39   ` Jerome Glisse
2019-11-03 21:17 ` [PATCH v2 02/18] mm/gup: factor out duplicate code from four routines John Hubbard
2019-11-04 16:51   ` Jerome Glisse
2019-11-03 21:17 ` [PATCH v2 03/18] goldish_pipe: rename local pin_user_pages() routine John Hubbard
2019-11-04 16:52   ` Jerome Glisse
2019-11-03 21:17 ` [PATCH v2 04/18] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
2019-11-10 10:10   ` Hans Verkuil
2019-11-11 21:46     ` John Hubbard
2019-11-03 21:18 ` [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
2019-11-04 17:33   ` Jerome Glisse
2019-11-04 19:04     ` John Hubbard
2019-11-04 19:18       ` Jerome Glisse
2019-11-04 19:30         ` John Hubbard
2019-11-04 19:52           ` Jerome Glisse
2019-11-04 20:09             ` John Hubbard
2019-11-04 20:31               ` Jason Gunthorpe
2019-11-04 20:40                 ` John Hubbard
2019-11-04 20:31               ` Jerome Glisse
2019-11-04 20:37                 ` Jason Gunthorpe
2019-11-04 20:57                   ` John Hubbard
2019-11-04 21:15                     ` Jason Gunthorpe
2019-11-04 21:34                       ` John Hubbard
2019-11-04 20:33   ` David Rientjes
2019-11-04 20:48     ` Jerome Glisse
2019-11-05 13:10   ` Mike Rapoport
2019-11-05 19:00     ` John Hubbard
2019-11-07  2:25       ` Ira Weiny
2019-11-07  8:07       ` Mike Rapoport
2019-11-03 21:18 ` [PATCH v2 06/18] goldish_pipe: convert to pin_user_pages() and put_user_page() John Hubbard
2019-11-03 21:18 ` [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*() John Hubbard
2019-11-04 20:33   ` Jason Gunthorpe
2019-11-04 20:48     ` John Hubbard
2019-11-04 20:57       ` Jason Gunthorpe
2019-11-04 22:03         ` John Hubbard
2019-11-05  2:32           ` Jason Gunthorpe
2019-11-07  2:26         ` Ira Weiny
2019-11-03 21:18 ` [PATCH v2 08/18] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
2019-11-04 17:41   ` Jerome Glisse
2019-11-03 21:18 ` [PATCH v2 09/18] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
2019-11-04 17:44   ` Jerome Glisse
2019-11-04 18:22     ` John Hubbard
2019-11-03 21:18 ` [PATCH v2 10/18] fs/io_uring: set FOLL_PIN via pin_user_pages() John Hubbard
2019-11-03 21:18 ` [PATCH v2 11/18] net/xdp: " John Hubbard
2019-11-03 21:18 ` [PATCH v2 12/18] mm/gup: track FOLL_PIN pages John Hubbard
2019-11-04 18:52   ` Jerome Glisse
2019-11-04 22:49     ` John Hubbard
2019-11-04 23:49       ` Jerome Glisse [this message]
2019-11-05  0:18         ` John Hubbard
2019-11-03 21:18 ` [PATCH v2 13/18] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
2019-11-10 10:11   ` Hans Verkuil
2019-11-03 21:18 ` [PATCH v2 14/18] vfio, mm: " John Hubbard
2019-11-03 21:18 ` [PATCH v2 15/18] powerpc: book3s64: convert to pin_longterm_pages() and put_user_page() John Hubbard
2019-11-03 21:18 ` [PATCH v2 16/18] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
2019-11-03 21:18 ` [PATCH v2 17/18] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
2019-11-03 21:18 ` [PATCH v2 18/18] mm/gup: remove support for gup(FOLL_LONGTERM) John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191104234920.GA18515@redhat.com \
    --to=jglisse@redhat.com \
    --cc=airlied@linux.ie \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=benh@kernel.crashing.org \
    --cc=bjorn.topel@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=daniel@ffwll.ch \
    --cc=davem@davemloft.net \
    --cc=david@fromorbit.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@infradead.org \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=magnus.karlsson@intel.com \
    --cc=mchehab@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=mpe@ellerman.id.au \
    --cc=netdev@vger.kernel.org \
    --cc=paulus@samba.org \
    --cc=shuah@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).