Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages

From: Jerome Glisse <jglisse@redhat.com>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	"Dave Chinner" <david@fromorbit.com>,
	"David Airlie" <airlied@linux.ie>,
	"David S . Miller" <davem@davemloft.net>,
	"Ira Weiny" <ira.weiny@intel.com>, "Jan Kara" <jack@suse.cz>,
	"Jason Gunthorpe" <jgg@ziepe.ca>, "Jens Axboe" <axboe@kernel.dk>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Mauro Carvalho Chehab" <mchehab@kernel.org>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Michal Hocko" <mhocko@suse.com>,
	"Mike Kravetz" <mike.kravetz@oracle.com>,
	"Paul Mackerras" <paulus@samba.org>,
	"Shuah Khan" <shuah@kernel.org>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	bpf@vger.kernel.org, dri-devel@lists.freedesktop.org,
	kvm@vger.kernel.org, linux-block@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org,
	linux-rdma@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	netdev@vger.kernel.org, linux-mm@kvack.org,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages
Date: Mon, 4 Nov 2019 13:52:38 -0500	[thread overview]
Message-ID: <20191104185238.GG5134@redhat.com> (raw)
In-Reply-To: <20191103211813.213227-13-jhubbard@nvidia.com>

On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via put_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>    bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> This also has a couple of trivial, non-functional change fixes to
> try_get_compound_head(). That function got moved to the top of the
> file.

Maybe split that as a separate trivial patch.

> 
> This includes the following fix from Ira Weiny:
> 
> DAX requires detection of a page crossing to a ref count of 1.  Fix this
> for GUP pages by introducing put_devmap_managed_user_page() which
> accounts for GUP_PIN_COUNTING_BIAS now used by GUP.

Please do the put_devmap_managed_page() changes in a separate
patch, it would be a lot easier to follow, also on that front
see comments below.

> 
> [1] https://lwn.net/Articles/784574/ "Some slow progress on
> get_user_pages()"
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h       |  80 +++++++++++----
>  include/linux/mmzone.h   |   2 +
>  include/linux/page_ref.h |  10 ++
>  mm/gup.c                 | 213 +++++++++++++++++++++++++++++++--------
>  mm/huge_memory.c         |  32 +++++-
>  mm/hugetlb.c             |  28 ++++-
>  mm/memremap.c            |   4 +-
>  mm/vmstat.c              |   2 +
>  8 files changed, 300 insertions(+), 71 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cdfb6fedb271..03b3600843b7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -972,9 +972,10 @@ static inline bool is_zone_device_page(const struct page *page)
>  #endif
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void __put_devmap_managed_page(struct page *page, int count);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
>  	if (!static_branch_unlikely(&devmap_managed_key))
>  		return false;
> @@ -983,7 +984,6 @@ static inline bool put_devmap_managed_page(struct page *page)
>  	switch (page->pgmap->type) {
>  	case MEMORY_DEVICE_PRIVATE:
>  	case MEMORY_DEVICE_FS_DAX:
> -		__put_devmap_managed_page(page);
>  		return true;
>  	default:
>  		break;
> @@ -991,6 +991,19 @@ static inline bool put_devmap_managed_page(struct page *page)
>  	return false;
>  }
>  
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> +	bool is_devmap = page_is_devmap_managed(page);
> +
> +	if (is_devmap) {
> +		int count = page_ref_dec_return(page);
> +
> +		__put_devmap_managed_page(page, count);
> +	}
> +
> +	return is_devmap;
> +}

I think the __put_devmap_managed_page() should be rename
to free_devmap_managed_page() and that the count != 1
case move to this inline function ie:

static inline bool put_devmap_managed_page(struct page *page)
{
	bool is_devmap = page_is_devmap_managed(page);

	if (is_devmap) {
		int count = page_ref_dec_return(page);

		/*
		 * If refcount is 1 then page is freed and refcount is stable as nobody
		 * holds a reference on the page.
		 */
		if (count == 1)
			free_devmap_managed_page(page, count);
		else if (!count)
			__put_page(page);
	}

	return is_devmap;
}

> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
> @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page)
>  	return true;
>  }
>  
> +__must_check bool user_page_ref_inc(struct page *page);
> +

What about having it as an inline here as it is pretty small.

>  static inline void put_page(struct page *page)
>  {
>  	page = compound_head(page);
> @@ -1055,31 +1070,56 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> -/**
> - * put_user_page() - release a gup-pinned page
> - * @page:            pointer to page to be released
> +/*
> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
> + * the page's refcount so that two separate items are tracked: the original page
> + * reference count, and also a new count of how many get_user_pages() calls were
> + * made against the page. ("gup-pinned" is another term for the latter).
> + *
> + * With this scheme, get_user_pages() becomes special: such pages are marked
> + * as distinct from normal pages. As such, the new put_user_page() call (and
> + * its variants) must be used in order to release gup-pinned pages.
> + *
> + * Choice of value:
>   *
> - * Pages that were pinned via get_user_pages*() must be released via
> - * either put_user_page(), or one of the put_user_pages*() routines
> - * below. This is so that eventually, pages that are pinned via
> - * get_user_pages*() can be separately tracked and uniquely handled. In
> - * particular, interactions with RDMA and filesystems need special
> - * handling.
> + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
> + * counts with respect to get_user_pages() and put_user_page() becomes simpler,
> + * due to the fact that adding an even power of two to the page refcount has
> + * the effect of using only the upper N bits, for the code that counts up using
> + * the bias value. This means that the lower bits are left for the exclusive
> + * use of the original code that increments and decrements by one (or at least,
> + * by much smaller values than the bias value).
>   *
> - * put_user_page() and put_page() are not interchangeable, despite this early
> - * implementation that makes them look the same. put_user_page() calls must
> - * be perfectly matched up with get_user_page() calls.
> + * Of course, once the lower bits overflow into the upper bits (and this is
> + * OK, because subtraction recovers the original values), then visual inspection
> + * no longer suffices to directly view the separate counts. However, for normal
> + * applications that don't have huge page reference counts, this won't be an
> + * issue.
> + *
> + * Locking: the lockless algorithm described in page_cache_get_speculative()
> + * and page_cache_gup_pin_speculative() provides safe operation for
> + * get_user_pages and page_mkclean and other calls that race to set up page
> + * table entries.
>   */
> -static inline void put_user_page(struct page *page)
> -{
> -	put_page(page);
> -}
> +#define GUP_PIN_COUNTING_BIAS (1UL << 10)
>  
> +void put_user_page(struct page *page);
>  void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>  			       bool make_dirty);
> -
>  void put_user_pages(struct page **pages, unsigned long npages);
>  
> +/**
> + * page_dma_pinned() - report if a page is pinned by a call to pin_user_pages*()
> + * or pin_longterm_pages*()
> + * @page:	pointer to page to be queried.
> + * @Return:	True, if it is likely that the page has been "dma-pinned".
> + *		False, if the page is definitely not dma-pinned.
> + */

Maybe add a small comment about wrap around :)

> +static inline bool page_dma_pinned(struct page *page)
> +{
> +	return (page_ref_count(compound_head(page))) >= GUP_PIN_COUNTING_BIAS;
> +}
> +
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define SECTION_IN_PAGE_FLAGS
>  #endif

[...]

> diff --git a/mm/gup.c b/mm/gup.c
> index 1aea48427879..c9727e65fad3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c

[...]

> @@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>  
>  		pgmap = get_dev_pagemap(pfn, pgmap);
>  		if (unlikely(!pgmap)) {
> -			undo_dev_pagemap(nr, nr_start, pages);
> +			undo_dev_pagemap(nr, nr_start, flags, pages);
>  			return 0;
>  		}
>  		SetPageReferenced(page);
>  		pages[*nr] = page;
> -		get_page(page);
> +
> +		if (flags & FOLL_PIN) {
> +			if (unlikely(!user_page_ref_inc(page))) {
> +				undo_dev_pagemap(nr, nr_start, flags, pages);
> +				return 0;
> +			}

Maybe add a comment about a case that should never happens ie
user_page_ref_inc() fails after the second iteration of the
loop as it would be broken and a bug to call undo_dev_pagemap()
after the first iteration of that loop.

Also i believe that this should never happens as if first
iteration succeed than __page_cache_add_speculative() will
succeed for all the iterations.

Note that the pgmap case above follows that too ie the call to
get_dev_pagemap() can only fail on first iteration of the loop,
well i assume you can never have a huge device page that span
different pgmap ie different devices (which is a reasonable
assumption). So maybe this code needs fixing ie :

		pgmap = get_dev_pagemap(pfn, pgmap);
		if (unlikely(!pgmap))
			return 0;

> +		} else
> +			get_page(page);
> +
>  		(*nr)++;
>  		pfn++;
>  	} while (addr += PAGE_SIZE, addr != end);

[...]

> @@ -2409,7 +2540,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
>  	unsigned long addr, len, end;
>  	int nr = 0, ret = 0;
>  
> -	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
> +	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))

Maybe add a comments to explain, something like:

/*
 * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN
 *
 * Note that get_user_pages_fast() imply FOLL_GET flag by default but
 * callers can over-ride this default to pin case by setting FOLL_PIN.
 */

>  		return -EINVAL;
>  
>  	start = untagged_addr(start) & PAGE_MASK;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 13cc93785006..66bf4c8b88f1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c

[...]

> @@ -968,7 +973,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	if (!*pgmap)
>  		return ERR_PTR(-EFAULT);
>  	page = pfn_to_page(pfn);
> -	get_page(page);
> +
> +	if (flags & FOLL_GET)
> +		get_page(page);
> +	else if (flags & FOLL_PIN)
> +		if (unlikely(!user_page_ref_inc(page)))
> +			page = ERR_PTR(-ENOMEM);

While i agree that user_page_ref_inc() (ie page_cache_add_speculative())
should never fails here as we are holding the pmd lock and thus no one
can unmap the pmd and free the page it points to. I believe you should
return -EFAULT like for the pgmap and not -ENOMEM as the pgmap should
not fail either for the same reason. Thus it would be better to have
consistent error. Maybe also add a comments explaining that it should
not fail here.

>  
>  	return page;
>  }

[...]

> @@ -1100,7 +1115,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>  	 * device mapped pages can only be returned if the
>  	 * caller will manage the page reference count.
>  	 */
> -	if (!(flags & FOLL_GET))
> +	if (!(flags & (FOLL_GET | FOLL_PIN)))
>  		return ERR_PTR(-EEXIST);

Maybe add a comment that FOLL_GET or FOLL_PIN must be set.

>  	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
> @@ -1108,7 +1123,12 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>  	if (!*pgmap)
>  		return ERR_PTR(-EFAULT);
>  	page = pfn_to_page(pfn);
> -	get_page(page);
> +
> +	if (flags & FOLL_GET)
> +		get_page(page);
> +	else if (flags & FOLL_PIN)
> +		if (unlikely(!user_page_ref_inc(page)))
> +			page = ERR_PTR(-ENOMEM);

Same as for follow_devmap_pmd() see above.

>  
>  	return page;
>  }
> @@ -1522,8 +1542,12 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  skip_mlock:
>  	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
>  	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
> +
>  	if (flags & FOLL_GET)
>  		get_page(page);
> +	else if (flags & FOLL_PIN)
> +		if (unlikely(!user_page_ref_inc(page)))
> +			page = NULL;

This should not fail either as we are holding the pmd lock maybe add
a comment. Dunno if we want a WARN() or something to catch this
degenerate case, or dump the page.

>  
>  out:
>  	return page;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b45a95363a84..da335b1cd798 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4462,7 +4462,17 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  same_page:
>  		if (pages) {
>  			pages[i] = mem_map_offset(page, pfn_offset);
> -			get_page(pages[i]);
> +
> +			if (flags & FOLL_GET)
> +				get_page(pages[i]);
> +			else if (flags & FOLL_PIN)
> +				if (unlikely(!user_page_ref_inc(pages[i]))) {
> +					spin_unlock(ptl);
> +					remainder = 0;
> +					err = -ENOMEM;
> +					WARN_ON_ONCE(1);
> +					break;
> +				}
>  		}

user_page_ref_inc() should not fail here either because we hold the
ptl, so the WAR_ON_ONCE() is right but maybe add a comment.

>  
>  		if (vmas)

[...]

> @@ -5034,8 +5050,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  	pte = huge_ptep_get((pte_t *)pmd);
>  	if (pte_present(pte)) {
>  		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> +
>  		if (flags & FOLL_GET)
>  			get_page(page);
> +		else if (flags & FOLL_PIN)
> +			if (unlikely(!user_page_ref_inc(page))) {
> +				page = NULL;
> +				goto out;
> +			}

This should not fail either (again holding pmd lock), dunno if we want
a warn or something to catch this degenerate case.

>  	} else {
>  		if (is_hugetlb_entry_migration(pte)) {
>  			spin_unlock(ptl);

[...]