All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	"Dave Chinner" <david@fromorbit.com>,
	"David Airlie" <airlied@linux.ie>,
	"David S . Miller" <davem@davemloft.net>,
	"Ira Weiny" <ira.weiny@intel.com>, "Jan Kara" <jack@suse.cz>,
	"Jason Gunthorpe" <jgg@ziepe.ca>, "Jens Axboe" <axboe@kernel.dk>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Mauro Carvalho Chehab" <mchehab@kernel.org>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Michal Hocko" <mhocko@suse.com>,
	"Mike Kravetz" <mike.kravetz@oracle.com>,
	"Paul Mackerras" <paulus@samba.org>,
	"Shuah Khan" <shuah@kernel.org>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	bpf@vger.kernel.org, dri-devel@lists.freedesktop.org,
	kvm@vger.kernel.org, linux-block@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org,
	linux-rdma@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	netdev@vger.kernel.org, linux-mm@kvack.org,
	LKML <linux-kernel@vger.kernel.org>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH v10 23/25] mm/gup: track FOLL_PIN pages
Date: Thu, 12 Dec 2019 11:17:41 +0100	[thread overview]
Message-ID: <20191212101741.GD10065@quack2.suse.cz> (raw)
In-Reply-To: <20191212081917.1264184-24-jhubbard@nvidia.com>

On Thu 12-12-19 00:19:15, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via unpin_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>    bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
>     https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
>     https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
>     https://lwn.net/Articles/753027/
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Thanks for the patch. As a side note, given this series is rather big, it
may be better to send just individual updated patches (as replies to the
review comments) instead of resending the whole series every time. And then
you can resend the whole series once enough changes accumulate or we reach
seemingly final state.  That way people don't have to crawl through lots of
uninteresing emails...  Just something to keep in mind for the future.

I've spotted just one issue in this patch (see below), the rest are just
small style nits.

> +#define page_ref_zero_or_close_to_bias_overflow(page) \
> +	((unsigned int) page_ref_count(page) + \
> +		GUP_PIN_COUNTING_BIAS <= GUP_PIN_COUNTING_BIAS)
> +

...

> +/**
> + * page_dma_pinned() - report if a page is pinned for DMA.
> + *
> + * This function checks if a page has been pinned via a call to
> + * pin_user_pages*().
> + *
> + * The return value is partially fuzzy: false is not fuzzy, because it means
> + * "definitely not pinned for DMA", but true means "probably pinned for DMA, but
> + * possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS worth
> + * of normal page references".
> + *
> + * False positives are OK, because: a) it's unlikely for a page to get that many
> + * refcounts, and b) all the callers of this routine are expected to be able to
> + * deal gracefully with a false positive.
> + *
> + * For more information, please see Documentation/vm/pin_user_pages.rst.
> + *
> + * @page:	pointer to page to be queried.
> + * @Return:	True, if it is likely that the page has been "dma-pinned".
> + *		False, if the page is definitely not dma-pinned.
> + */
> +static inline bool page_dma_pinned(struct page *page)
> +{
> +	return (page_ref_count(compound_head(page))) >= GUP_PIN_COUNTING_BIAS;
> +}
> +

I realized one think WRT handling of page refcount overflow: Page refcount is
signed and e.g. try_get_page() fails once the refcount is negative. That
means that:

a) page_ref_zero_or_close_to_bias_overflow() is not necessary - all places
that use pinning (i.e., advance refcount by GUP_PIN_COUNTING_BIAS) are not
necesary, we should just rely on the check for negative value for
consistency.

b) page_dma_pinned() has to be careful and type page_ref_count() to
unsigned type for comparison as otherwise overflowed refcount would
suddently appear as not-pinned.

> +/**
> + * try_pin_compound_head() - mark a compound page as being used by
> + * pin_user_pages*().
> + *
> + * This is the FOLL_PIN counterpart to try_get_compound_head().
> + *
> + * @page:	pointer to page to be marked
> + * @Return:	the compound head page, with ref appropriately incremented,
> + * or NULL upon failure.
> + */
> +__must_check struct page *try_pin_compound_head(struct page *page, int refs)
> +{
> +	struct page *head = try_get_compound_head(page,
> +						  GUP_PIN_COUNTING_BIAS * refs);
> +	if (!head)
> +		return NULL;
> +
> +	__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, refs);
> +	return head;
> +}
> +
> +/*
> + * try_grab_compound_head() - attempt to elevate a page's refcount, by a
> + * flags-dependent amount.
> + *
> + * "grab" names in this file mean, "look at flags to decide whether to use
> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
> + *
> + * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
> + * same time. (That's true throughout the get_user_pages*() and
> + * pin_user_pages*() APIs.) Cases:
> + *
> + *	FOLL_GET: page's refcount will be incremented by 1.
> + *      FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.

Some tab vs space issue here... Generally we don't use tabs inside comments
for indenting so I'd wote for using just spaces.

> + *
> + * Return: head page (with refcount appropriately incremented) for success, or
> + * NULL upon failure. If neither FOLL_GET nor FOLL_PIN was set, that's
> + * considered failure, and furthermore, a likely bug in the caller, so a warning
> + * is also emitted.
> + */
> +static __maybe_unused struct page *try_grab_compound_head(struct page *page,
> +							  int refs,
> +							  unsigned int flags)
> +{
> +	if (flags & FOLL_GET)
> +		return try_get_compound_head(page, refs);
> +	else if (flags & FOLL_PIN)
> +		return try_pin_compound_head(page, refs);
> +
> +	WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == 0);

This could be just WARN_ON_ONCE(1), right?

> +	return NULL;
> +}
> +
> +/**
> + * try_grab_page() - elevate a page's refcount by a flag-dependent amount
> + *
> + * This might not do anything at all, depending on the flags argument.
> + *
> + * "grab" names in this file mean, "look at flags to decide whether to use
> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
> + *
> + * @page:	pointer to page to be grabbed
> + * @flags:	gup flags: these are the FOLL_* flag values.
> + *
> + * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same
> + * time. Cases:
> + *
> + *	FOLL_GET: page's refcount will be incremented by 1.
> + *      FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.

Again tab vs space difference here.

> + *
> + * Return: true for success, or if no action was required (if neither FOLL_PIN
> + * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or
> + * FOLL_PIN was set, but the page could not be grabbed.
> + */
> +bool __must_check try_grab_page(struct page *page, unsigned int flags)
> +{
> +	if (flags & FOLL_GET)
> +		return try_get_page(page);
> +	else if (flags & FOLL_PIN) {
> +		page = compound_head(page);
> +		WARN_ON_ONCE(flags & FOLL_GET);
> +
> +		if (WARN_ON_ONCE(page_ref_zero_or_close_to_bias_overflow(page)))
> +			return false;

As I mentioned above, this will need "negative refcount" check instead...

> +
> +		page_ref_add(page, GUP_PIN_COUNTING_BIAS);
> +		__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1);
> +	}
> +
> +	return true;
> +}

...

> @@ -1468,6 +1482,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct page *page = NULL;
> +	struct page *subpage = NULL;
>  
>  	assert_spin_locked(pmd_lockptr(mm, pmd));
>  
> @@ -1486,6 +1501,14 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>  	if (flags & FOLL_TOUCH)
>  		touch_pmd(vma, addr, pmd, flags);
> +
> +	subpage = page;
> +	subpage += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> +	VM_BUG_ON_PAGE(!PageCompound(subpage) &&
> +		       !is_zone_device_page(subpage), subpage);
> +	if (!try_grab_page(subpage, flags))
> +		return ERR_PTR(-EFAULT);
> +

Hum, I think you've made this change more complex than it has to be.
try_grab_page() is the same for head page or subpage because we increment
the refcount on the compound_head(page) anyway. So I'd leave this function
as is (not add subpage or move VM_BUG_ON_PAGE()), just have at this place:

	if (!try_grab_page(page, flags))
		return ERR_PTR(-EFAULT);

Also one comment regarding the error code. Some places seem to return -ENOMEM
when they fail to grab page reference. Shouldn't we rather return that one
for consistency?

>  	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
>  		/*
>  		 * We don't mlock() pte-mapped THPs. This way we can avoid
> @@ -1509,24 +1532,18 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  		 */
>  
>  		if (PageAnon(page) && compound_mapcount(page) != 1)
> -			goto skip_mlock;
> +			goto out;
>  		if (PageDoubleMap(page) || !page->mapping)
> -			goto skip_mlock;
> +			goto out;
>  		if (!trylock_page(page))
> -			goto skip_mlock;
> +			goto out;
>  		lru_add_drain();
>  		if (page->mapping && !PageDoubleMap(page))
>  			mlock_vma_page(page);
>  		unlock_page(page);
>  	}
> -skip_mlock:
> -	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> -	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
> -	if (flags & FOLL_GET)
> -		get_page(page);
> -
>  out:
> -	return page;
> +	return subpage;
>  }
>  

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Michal Hocko" <mhocko@suse.com>, "Jan Kara" <jack@suse.cz>,
	kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	"David Airlie" <airlied@linux.ie>,
	"Dave Chinner" <david@fromorbit.com>,
	dri-devel@lists.freedesktop.org,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, "Paul Mackerras" <paulus@samba.org>,
	linux-kselftest@vger.kernel.org,
	"Ira Weiny" <ira.weiny@intel.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	linux-rdma@vger.kernel.org,
	"Christoph Hellwig" <hch@infradead.org>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	linux-media@vger.kernel.org, "Shuah Khan" <shuah@kernel.org>,
	linux-block@vger.kernel.org, "Jérôme Glisse" <jglisse@redhat.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Mauro Carvalho Chehab" <mchehab@kernel.org>,
	bpf@vger.kernel.org,
	"Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Jens Axboe" <axboe@kernel.dk>,
	netdev@vger.kernel.org,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	linux-fsdevel@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linuxppc-dev@lists.ozlabs.org,
	"David S . Miller" <davem@davemloft.net>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	"Mike Kravetz" <mike.kravetz@oracle.com>
Subject: Re: [PATCH v10 23/25] mm/gup: track FOLL_PIN pages
Date: Thu, 12 Dec 2019 11:17:41 +0100	[thread overview]
Message-ID: <20191212101741.GD10065@quack2.suse.cz> (raw)
In-Reply-To: <20191212081917.1264184-24-jhubbard@nvidia.com>

On Thu 12-12-19 00:19:15, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via unpin_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>    bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
>     https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
>     https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
>     https://lwn.net/Articles/753027/
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Thanks for the patch. As a side note, given this series is rather big, it
may be better to send just individual updated patches (as replies to the
review comments) instead of resending the whole series every time. And then
you can resend the whole series once enough changes accumulate or we reach
seemingly final state.  That way people don't have to crawl through lots of
uninteresing emails...  Just something to keep in mind for the future.

I've spotted just one issue in this patch (see below), the rest are just
small style nits.

> +#define page_ref_zero_or_close_to_bias_overflow(page) \
> +	((unsigned int) page_ref_count(page) + \
> +		GUP_PIN_COUNTING_BIAS <= GUP_PIN_COUNTING_BIAS)
> +

...

> +/**
> + * page_dma_pinned() - report if a page is pinned for DMA.
> + *
> + * This function checks if a page has been pinned via a call to
> + * pin_user_pages*().
> + *
> + * The return value is partially fuzzy: false is not fuzzy, because it means
> + * "definitely not pinned for DMA", but true means "probably pinned for DMA, but
> + * possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS worth
> + * of normal page references".
> + *
> + * False positives are OK, because: a) it's unlikely for a page to get that many
> + * refcounts, and b) all the callers of this routine are expected to be able to
> + * deal gracefully with a false positive.
> + *
> + * For more information, please see Documentation/vm/pin_user_pages.rst.
> + *
> + * @page:	pointer to page to be queried.
> + * @Return:	True, if it is likely that the page has been "dma-pinned".
> + *		False, if the page is definitely not dma-pinned.
> + */
> +static inline bool page_dma_pinned(struct page *page)
> +{
> +	return (page_ref_count(compound_head(page))) >= GUP_PIN_COUNTING_BIAS;
> +}
> +

I realized one think WRT handling of page refcount overflow: Page refcount is
signed and e.g. try_get_page() fails once the refcount is negative. That
means that:

a) page_ref_zero_or_close_to_bias_overflow() is not necessary - all places
that use pinning (i.e., advance refcount by GUP_PIN_COUNTING_BIAS) are not
necesary, we should just rely on the check for negative value for
consistency.

b) page_dma_pinned() has to be careful and type page_ref_count() to
unsigned type for comparison as otherwise overflowed refcount would
suddently appear as not-pinned.

> +/**
> + * try_pin_compound_head() - mark a compound page as being used by
> + * pin_user_pages*().
> + *
> + * This is the FOLL_PIN counterpart to try_get_compound_head().
> + *
> + * @page:	pointer to page to be marked
> + * @Return:	the compound head page, with ref appropriately incremented,
> + * or NULL upon failure.
> + */
> +__must_check struct page *try_pin_compound_head(struct page *page, int refs)
> +{
> +	struct page *head = try_get_compound_head(page,
> +						  GUP_PIN_COUNTING_BIAS * refs);
> +	if (!head)
> +		return NULL;
> +
> +	__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, refs);
> +	return head;
> +}
> +
> +/*
> + * try_grab_compound_head() - attempt to elevate a page's refcount, by a
> + * flags-dependent amount.
> + *
> + * "grab" names in this file mean, "look at flags to decide whether to use
> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
> + *
> + * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
> + * same time. (That's true throughout the get_user_pages*() and
> + * pin_user_pages*() APIs.) Cases:
> + *
> + *	FOLL_GET: page's refcount will be incremented by 1.
> + *      FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.

Some tab vs space issue here... Generally we don't use tabs inside comments
for indenting so I'd wote for using just spaces.

> + *
> + * Return: head page (with refcount appropriately incremented) for success, or
> + * NULL upon failure. If neither FOLL_GET nor FOLL_PIN was set, that's
> + * considered failure, and furthermore, a likely bug in the caller, so a warning
> + * is also emitted.
> + */
> +static __maybe_unused struct page *try_grab_compound_head(struct page *page,
> +							  int refs,
> +							  unsigned int flags)
> +{
> +	if (flags & FOLL_GET)
> +		return try_get_compound_head(page, refs);
> +	else if (flags & FOLL_PIN)
> +		return try_pin_compound_head(page, refs);
> +
> +	WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == 0);

This could be just WARN_ON_ONCE(1), right?

> +	return NULL;
> +}
> +
> +/**
> + * try_grab_page() - elevate a page's refcount by a flag-dependent amount
> + *
> + * This might not do anything at all, depending on the flags argument.
> + *
> + * "grab" names in this file mean, "look at flags to decide whether to use
> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
> + *
> + * @page:	pointer to page to be grabbed
> + * @flags:	gup flags: these are the FOLL_* flag values.
> + *
> + * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same
> + * time. Cases:
> + *
> + *	FOLL_GET: page's refcount will be incremented by 1.
> + *      FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.

Again tab vs space difference here.

> + *
> + * Return: true for success, or if no action was required (if neither FOLL_PIN
> + * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or
> + * FOLL_PIN was set, but the page could not be grabbed.
> + */
> +bool __must_check try_grab_page(struct page *page, unsigned int flags)
> +{
> +	if (flags & FOLL_GET)
> +		return try_get_page(page);
> +	else if (flags & FOLL_PIN) {
> +		page = compound_head(page);
> +		WARN_ON_ONCE(flags & FOLL_GET);
> +
> +		if (WARN_ON_ONCE(page_ref_zero_or_close_to_bias_overflow(page)))
> +			return false;

As I mentioned above, this will need "negative refcount" check instead...

> +
> +		page_ref_add(page, GUP_PIN_COUNTING_BIAS);
> +		__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1);
> +	}
> +
> +	return true;
> +}

...

> @@ -1468,6 +1482,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct page *page = NULL;
> +	struct page *subpage = NULL;
>  
>  	assert_spin_locked(pmd_lockptr(mm, pmd));
>  
> @@ -1486,6 +1501,14 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>  	if (flags & FOLL_TOUCH)
>  		touch_pmd(vma, addr, pmd, flags);
> +
> +	subpage = page;
> +	subpage += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> +	VM_BUG_ON_PAGE(!PageCompound(subpage) &&
> +		       !is_zone_device_page(subpage), subpage);
> +	if (!try_grab_page(subpage, flags))
> +		return ERR_PTR(-EFAULT);
> +

Hum, I think you've made this change more complex than it has to be.
try_grab_page() is the same for head page or subpage because we increment
the refcount on the compound_head(page) anyway. So I'd leave this function
as is (not add subpage or move VM_BUG_ON_PAGE()), just have at this place:

	if (!try_grab_page(page, flags))
		return ERR_PTR(-EFAULT);

Also one comment regarding the error code. Some places seem to return -ENOMEM
when they fail to grab page reference. Shouldn't we rather return that one
for consistency?

>  	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
>  		/*
>  		 * We don't mlock() pte-mapped THPs. This way we can avoid
> @@ -1509,24 +1532,18 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  		 */
>  
>  		if (PageAnon(page) && compound_mapcount(page) != 1)
> -			goto skip_mlock;
> +			goto out;
>  		if (PageDoubleMap(page) || !page->mapping)
> -			goto skip_mlock;
> +			goto out;
>  		if (!trylock_page(page))
> -			goto skip_mlock;
> +			goto out;
>  		lru_add_drain();
>  		if (page->mapping && !PageDoubleMap(page))
>  			mlock_vma_page(page);
>  		unlock_page(page);
>  	}
> -skip_mlock:
> -	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> -	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
> -	if (flags & FOLL_GET)
> -		get_page(page);
> -
>  out:
> -	return page;
> +	return subpage;
>  }
>  

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Michal Hocko" <mhocko@suse.com>, "Jan Kara" <jack@suse.cz>,
	kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	"David Airlie" <airlied@linux.ie>,
	"Dave Chinner" <david@fromorbit.com>,
	dri-devel@lists.freedesktop.org,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, "Paul Mackerras" <paulus@samba.org>,
	linux-kselftest@vger.kernel.org,
	"Ira Weiny" <ira.weiny@intel.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	linux-rdma@vger.kernel.org,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	linux-media@vger.kernel.org, "Shuah Khan" <shuah@kernel.org>,
	linux-block@vger.kernel.org, "Jérôme Glisse" <jglisse@redhat.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Mauro Carvalho Chehab" <mchehab@kernel.org>,
	bpf@vger.kernel.org,
	"Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Jens Axboe" <axboe@kernel.dk>,
	netdev@vger.kernel.org,
	"Alex Williamson" <alex.williamson@redhat.com>,
	linux-fsdevel@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linuxppc-dev@lists.ozlabs.org,
	"David S . Miller" <davem@davemloft.net>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	"Mike Kravetz" <mike.kravetz@oracle.com>
Subject: Re: [PATCH v10 23/25] mm/gup: track FOLL_PIN pages
Date: Thu, 12 Dec 2019 11:17:41 +0100	[thread overview]
Message-ID: <20191212101741.GD10065@quack2.suse.cz> (raw)
In-Reply-To: <20191212081917.1264184-24-jhubbard@nvidia.com>

On Thu 12-12-19 00:19:15, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via unpin_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>    bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
>     https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
>     https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
>     https://lwn.net/Articles/753027/
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Thanks for the patch. As a side note, given this series is rather big, it
may be better to send just individual updated patches (as replies to the
review comments) instead of resending the whole series every time. And then
you can resend the whole series once enough changes accumulate or we reach
seemingly final state.  That way people don't have to crawl through lots of
uninteresing emails...  Just something to keep in mind for the future.

I've spotted just one issue in this patch (see below), the rest are just
small style nits.

> +#define page_ref_zero_or_close_to_bias_overflow(page) \
> +	((unsigned int) page_ref_count(page) + \
> +		GUP_PIN_COUNTING_BIAS <= GUP_PIN_COUNTING_BIAS)
> +

...

> +/**
> + * page_dma_pinned() - report if a page is pinned for DMA.
> + *
> + * This function checks if a page has been pinned via a call to
> + * pin_user_pages*().
> + *
> + * The return value is partially fuzzy: false is not fuzzy, because it means
> + * "definitely not pinned for DMA", but true means "probably pinned for DMA, but
> + * possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS worth
> + * of normal page references".
> + *
> + * False positives are OK, because: a) it's unlikely for a page to get that many
> + * refcounts, and b) all the callers of this routine are expected to be able to
> + * deal gracefully with a false positive.
> + *
> + * For more information, please see Documentation/vm/pin_user_pages.rst.
> + *
> + * @page:	pointer to page to be queried.
> + * @Return:	True, if it is likely that the page has been "dma-pinned".
> + *		False, if the page is definitely not dma-pinned.
> + */
> +static inline bool page_dma_pinned(struct page *page)
> +{
> +	return (page_ref_count(compound_head(page))) >= GUP_PIN_COUNTING_BIAS;
> +}
> +

I realized one think WRT handling of page refcount overflow: Page refcount is
signed and e.g. try_get_page() fails once the refcount is negative. That
means that:

a) page_ref_zero_or_close_to_bias_overflow() is not necessary - all places
that use pinning (i.e., advance refcount by GUP_PIN_COUNTING_BIAS) are not
necesary, we should just rely on the check for negative value for
consistency.

b) page_dma_pinned() has to be careful and type page_ref_count() to
unsigned type for comparison as otherwise overflowed refcount would
suddently appear as not-pinned.

> +/**
> + * try_pin_compound_head() - mark a compound page as being used by
> + * pin_user_pages*().
> + *
> + * This is the FOLL_PIN counterpart to try_get_compound_head().
> + *
> + * @page:	pointer to page to be marked
> + * @Return:	the compound head page, with ref appropriately incremented,
> + * or NULL upon failure.
> + */
> +__must_check struct page *try_pin_compound_head(struct page *page, int refs)
> +{
> +	struct page *head = try_get_compound_head(page,
> +						  GUP_PIN_COUNTING_BIAS * refs);
> +	if (!head)
> +		return NULL;
> +
> +	__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, refs);
> +	return head;
> +}
> +
> +/*
> + * try_grab_compound_head() - attempt to elevate a page's refcount, by a
> + * flags-dependent amount.
> + *
> + * "grab" names in this file mean, "look at flags to decide whether to use
> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
> + *
> + * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
> + * same time. (That's true throughout the get_user_pages*() and
> + * pin_user_pages*() APIs.) Cases:
> + *
> + *	FOLL_GET: page's refcount will be incremented by 1.
> + *      FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.

Some tab vs space issue here... Generally we don't use tabs inside comments
for indenting so I'd wote for using just spaces.

> + *
> + * Return: head page (with refcount appropriately incremented) for success, or
> + * NULL upon failure. If neither FOLL_GET nor FOLL_PIN was set, that's
> + * considered failure, and furthermore, a likely bug in the caller, so a warning
> + * is also emitted.
> + */
> +static __maybe_unused struct page *try_grab_compound_head(struct page *page,
> +							  int refs,
> +							  unsigned int flags)
> +{
> +	if (flags & FOLL_GET)
> +		return try_get_compound_head(page, refs);
> +	else if (flags & FOLL_PIN)
> +		return try_pin_compound_head(page, refs);
> +
> +	WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == 0);

This could be just WARN_ON_ONCE(1), right?

> +	return NULL;
> +}
> +
> +/**
> + * try_grab_page() - elevate a page's refcount by a flag-dependent amount
> + *
> + * This might not do anything at all, depending on the flags argument.
> + *
> + * "grab" names in this file mean, "look at flags to decide whether to use
> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
> + *
> + * @page:	pointer to page to be grabbed
> + * @flags:	gup flags: these are the FOLL_* flag values.
> + *
> + * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same
> + * time. Cases:
> + *
> + *	FOLL_GET: page's refcount will be incremented by 1.
> + *      FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.

Again tab vs space difference here.

> + *
> + * Return: true for success, or if no action was required (if neither FOLL_PIN
> + * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or
> + * FOLL_PIN was set, but the page could not be grabbed.
> + */
> +bool __must_check try_grab_page(struct page *page, unsigned int flags)
> +{
> +	if (flags & FOLL_GET)
> +		return try_get_page(page);
> +	else if (flags & FOLL_PIN) {
> +		page = compound_head(page);
> +		WARN_ON_ONCE(flags & FOLL_GET);
> +
> +		if (WARN_ON_ONCE(page_ref_zero_or_close_to_bias_overflow(page)))
> +			return false;

As I mentioned above, this will need "negative refcount" check instead...

> +
> +		page_ref_add(page, GUP_PIN_COUNTING_BIAS);
> +		__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1);
> +	}
> +
> +	return true;
> +}

...

> @@ -1468,6 +1482,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct page *page = NULL;
> +	struct page *subpage = NULL;
>  
>  	assert_spin_locked(pmd_lockptr(mm, pmd));
>  
> @@ -1486,6 +1501,14 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>  	if (flags & FOLL_TOUCH)
>  		touch_pmd(vma, addr, pmd, flags);
> +
> +	subpage = page;
> +	subpage += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> +	VM_BUG_ON_PAGE(!PageCompound(subpage) &&
> +		       !is_zone_device_page(subpage), subpage);
> +	if (!try_grab_page(subpage, flags))
> +		return ERR_PTR(-EFAULT);
> +

Hum, I think you've made this change more complex than it has to be.
try_grab_page() is the same for head page or subpage because we increment
the refcount on the compound_head(page) anyway. So I'd leave this function
as is (not add subpage or move VM_BUG_ON_PAGE()), just have at this place:

	if (!try_grab_page(page, flags))
		return ERR_PTR(-EFAULT);

Also one comment regarding the error code. Some places seem to return -ENOMEM
when they fail to grab page reference. Shouldn't we rather return that one
for consistency?

>  	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
>  		/*
>  		 * We don't mlock() pte-mapped THPs. This way we can avoid
> @@ -1509,24 +1532,18 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  		 */
>  
>  		if (PageAnon(page) && compound_mapcount(page) != 1)
> -			goto skip_mlock;
> +			goto out;
>  		if (PageDoubleMap(page) || !page->mapping)
> -			goto skip_mlock;
> +			goto out;
>  		if (!trylock_page(page))
> -			goto skip_mlock;
> +			goto out;
>  		lru_add_drain();
>  		if (page->mapping && !PageDoubleMap(page))
>  			mlock_vma_page(page);
>  		unlock_page(page);
>  	}
> -skip_mlock:
> -	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> -	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
> -	if (flags & FOLL_GET)
> -		get_page(page);
> -
>  out:
> -	return page;
> +	return subpage;
>  }
>  

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

  reply	other threads:[~2019-12-12 10:18 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-12  8:18 [PATCH v10 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
2019-12-12  8:18 ` John Hubbard
2019-12-12  8:18 ` John Hubbard
2019-12-12  8:18 ` [PATCH v10 01/25] mm/gup: factor out duplicate code from four routines John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18 ` [PATCH v10 02/25] mm/gup: move try_get_compound_head() to top, fix minor issues John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18 ` [PATCH v10 03/25] mm: Cleanup __put_devmap_managed_page() vs ->page_free() John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18 ` [PATCH v10 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18 ` [PATCH v10 05/25] goldish_pipe: rename local pin_user_pages() routine John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18 ` [PATCH v10 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18 ` [PATCH v10 07/25] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:18   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 08/25] mm/gup: allow FOLL_FORCE for get_user_pages_fast() John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 09/25] IB/umem: use get_user_pages_fast() to pin DMA pages John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 10/25] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 11/25] goldish_pipe: convert to pin_user_pages() and put_user_page() John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 12/25] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP John Hubbard
2019-12-12  8:19   ` [PATCH v10 12/25] IB/{core, hw, umem}: " John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 13/25] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 14/25] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 15/25] fs/io_uring: set FOLL_PIN via pin_user_pages() John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 16/25] net/xdp: " John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 17/25] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 18/25] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 19/25] vfio, mm: " John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 20/25] powerpc: book3s64: convert to pin_user_pages() and put_user_page() John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 21/25] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 22/25] mm, tree-wide: rename put_user_page*() to unpin_user_page*() John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 23/25] mm/gup: track FOLL_PIN pages John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12 10:17   ` Jan Kara [this message]
2019-12-12 10:17     ` Jan Kara
2019-12-12 10:17     ` Jan Kara
2019-12-14  3:26     ` [PATCH v11 " John Hubbard
2019-12-14  3:26       ` John Hubbard
2019-12-14  3:26       ` John Hubbard
2019-12-16 12:53       ` Jan Kara
2019-12-16 12:53         ` Jan Kara
2019-12-16 12:53         ` Jan Kara
2019-12-16 22:18         ` John Hubbard
2019-12-16 22:18           ` John Hubbard
2019-12-16 22:18           ` John Hubbard
2019-12-17  7:29           ` Jan Kara
2019-12-17  7:29             ` Jan Kara
2019-12-17  7:29             ` Jan Kara
2019-12-12  8:19 ` [PATCH v10 24/25] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19 ` [PATCH v10 25/25] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
2019-12-12  8:19   ` John Hubbard
2019-12-12  8:19   ` John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191212101741.GD10065@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=airlied@linux.ie \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=benh@kernel.crashing.org \
    --cc=bjorn.topel@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=daniel@ffwll.ch \
    --cc=davem@davemloft.net \
    --cc=david@fromorbit.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@infradead.org \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=magnus.karlsson@intel.com \
    --cc=mchehab@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=mpe@ellerman.id.au \
    --cc=netdev@vger.kernel.org \
    --cc=paulus@samba.org \
    --cc=shuah@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.