All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	"Dave Chinner" <david@fromorbit.com>,
	"David Airlie" <airlied@linux.ie>,
	"David S . Miller" <davem@davemloft.net>,
	"Ira Weiny" <ira.weiny@intel.com>, "Jan Kara" <jack@suse.cz>,
	"Jason Gunthorpe" <jgg@ziepe.ca>, "Jens Axboe" <axboe@kernel.dk>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Mauro Carvalho Chehab" <mchehab@kernel.org>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Michal Hocko" <mhocko@suse.com>,
	"Mike Kravetz" <mike.kravetz@oracle.com>,
	"Paul Mackerras" <paulus@samba.org>,
	"Shuah Khan" <shuah@kernel.org>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	bpf@vger.kernel.org, dri-devel@lists.freedesktop.org,
	kvm@vger.kernel.org, linux-block@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org,
	linux-rdma@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	netdev@vger.kernel.org, linux-mm@kvack.org,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v5 17/24] mm/gup: track FOLL_PIN pages
Date: Mon, 18 Nov 2019 12:58:29 +0100	[thread overview]
Message-ID: <20191118115829.GJ17319@quack2.suse.cz> (raw)
In-Reply-To: <20191115055340.1825745-18-jhubbard@nvidia.com>

On Thu 14-11-19 21:53:33, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via put_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>    bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1].
						^^ missing this reference
in the changelog...

> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6588d2e02628..db872766480f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1054,6 +1054,8 @@ static inline __must_check bool try_get_page(struct page *page)
>  	return true;
>  }
>  
> +__must_check bool user_page_ref_inc(struct page *page);
> +
>  static inline void put_page(struct page *page)
>  {
>  	page = compound_head(page);
> @@ -1071,29 +1073,70 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> -/**
> - * put_user_page() - release a gup-pinned page
> - * @page:            pointer to page to be released
> +/*
> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
> + * the page's refcount so that two separate items are tracked: the original page
> + * reference count, and also a new count of how many get_user_pages() calls were
							^^ pin_user_pages()

> + * made against the page. ("gup-pinned" is another term for the latter).
> + *
> + * With this scheme, get_user_pages() becomes special: such pages are marked
			^^^ pin_user_pages()

> + * as distinct from normal pages. As such, the put_user_page() call (and its
> + * variants) must be used in order to release gup-pinned pages.
> + *
> + * Choice of value:
>   *
> - * Pages that were pinned via pin_user_pages*() must be released via either
> - * put_user_page(), or one of the put_user_pages*() routines. This is so that
> - * eventually such pages can be separately tracked and uniquely handled. In
> - * particular, interactions with RDMA and filesystems need special handling.
> + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
> + * counts with respect to get_user_pages() and put_user_page() becomes simpler,
				^^^ pin_user_pages()

> + * due to the fact that adding an even power of two to the page refcount has
> + * the effect of using only the upper N bits, for the code that counts up using
> + * the bias value. This means that the lower bits are left for the exclusive
> + * use of the original code that increments and decrements by one (or at least,
> + * by much smaller values than the bias value).
>   *
> - * put_user_page() and put_page() are not interchangeable, despite this early
> - * implementation that makes them look the same. put_user_page() calls must
> - * be perfectly matched up with pin*() calls.
> + * Of course, once the lower bits overflow into the upper bits (and this is
> + * OK, because subtraction recovers the original values), then visual inspection
> + * no longer suffices to directly view the separate counts. However, for normal
> + * applications that don't have huge page reference counts, this won't be an
> + * issue.
> + *
> + * Locking: the lockless algorithm described in page_cache_get_speculative()
> + * and page_cache_gup_pin_speculative() provides safe operation for
> + * get_user_pages and page_mkclean and other calls that race to set up page
> + * table entries.
>   */
...
> @@ -2070,9 +2191,16 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>  	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
>  	refs = __record_subpages(page, addr, end, pages + *nr);
>  
> -	head = try_get_compound_head(head, refs);
> -	if (!head)
> -		return 0;
> +	if (flags & FOLL_PIN) {
> +		head = page;
> +		if (unlikely(!user_page_ref_inc(head)))
> +			return 0;
> +		head = page;

Why do you assign 'head' twice? Also the refcounting logic is repeated
several times so perhaps you can factor it out in to a helper function or
even move it to __record_subpages()?

> +	} else {
> +		head = try_get_compound_head(head, refs);
> +		if (!head)
> +			return 0;
> +	}
>  
>  	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>  		put_compound_head(head, refs);

So this will do the wrong thing for FOLL_PIN. We took just one "pin"
reference there but here we'll release 'refs' normal references AFAICT.
Also the fact that you take just one pin reference for each huge page
substantially changes how GUP refcounting works in the huge page case.
Currently, FOLL_GET users can be completely agnostic of huge pages. So you
can e.g. GUP whole 2 MB page, submit it as 2 different bios and then
drop page references from each bio completion function. With your new
FOLL_PIN behavior you cannot do that and I believe it will be a problem for
some users. So I think you have to maintain the behavior that you increase
the head->_refcount by (refs * GUP_PIN_COUNTING_BIAS) here.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Michal Hocko" <mhocko@suse.com>, "Jan Kara" <jack@suse.cz>,
	kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	"David Airlie" <airlied@linux.ie>,
	"Dave Chinner" <david@fromorbit.com>,
	dri-devel@lists.freedesktop.org,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, "Paul Mackerras" <paulus@samba.org>,
	linux-kselftest@vger.kernel.org,
	"Ira Weiny" <ira.weiny@intel.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	linux-rdma@vger.kernel.org,
	"Christoph Hellwig" <hch@infradead.org>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	linux-media@vger.kernel.org, "Shuah Khan" <shuah@kernel.org>,
	linux-block@vger.kernel.org, "Jérôme Glisse" <jglisse@redhat.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Mauro Carvalho Chehab" <mchehab@kernel.org>,
	bpf@vger.kernel.org,
	"Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Jens Axboe" <axboe@kernel.dk>,
	netdev@vger.kernel.org,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	linux-fsdevel@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linuxppc-dev@lists.ozlabs.org,
	"David S . Miller" <davem@davemloft.net>,
	"Mike Kravetz" <mike.kravetz@oracle.com>
Subject: Re: [PATCH v5 17/24] mm/gup: track FOLL_PIN pages
Date: Mon, 18 Nov 2019 12:58:29 +0100	[thread overview]
Message-ID: <20191118115829.GJ17319@quack2.suse.cz> (raw)
In-Reply-To: <20191115055340.1825745-18-jhubbard@nvidia.com>

On Thu 14-11-19 21:53:33, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via put_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>    bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1].
						^^ missing this reference
in the changelog...

> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6588d2e02628..db872766480f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1054,6 +1054,8 @@ static inline __must_check bool try_get_page(struct page *page)
>  	return true;
>  }
>  
> +__must_check bool user_page_ref_inc(struct page *page);
> +
>  static inline void put_page(struct page *page)
>  {
>  	page = compound_head(page);
> @@ -1071,29 +1073,70 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> -/**
> - * put_user_page() - release a gup-pinned page
> - * @page:            pointer to page to be released
> +/*
> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
> + * the page's refcount so that two separate items are tracked: the original page
> + * reference count, and also a new count of how many get_user_pages() calls were
							^^ pin_user_pages()

> + * made against the page. ("gup-pinned" is another term for the latter).
> + *
> + * With this scheme, get_user_pages() becomes special: such pages are marked
			^^^ pin_user_pages()

> + * as distinct from normal pages. As such, the put_user_page() call (and its
> + * variants) must be used in order to release gup-pinned pages.
> + *
> + * Choice of value:
>   *
> - * Pages that were pinned via pin_user_pages*() must be released via either
> - * put_user_page(), or one of the put_user_pages*() routines. This is so that
> - * eventually such pages can be separately tracked and uniquely handled. In
> - * particular, interactions with RDMA and filesystems need special handling.
> + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
> + * counts with respect to get_user_pages() and put_user_page() becomes simpler,
				^^^ pin_user_pages()

> + * due to the fact that adding an even power of two to the page refcount has
> + * the effect of using only the upper N bits, for the code that counts up using
> + * the bias value. This means that the lower bits are left for the exclusive
> + * use of the original code that increments and decrements by one (or at least,
> + * by much smaller values than the bias value).
>   *
> - * put_user_page() and put_page() are not interchangeable, despite this early
> - * implementation that makes them look the same. put_user_page() calls must
> - * be perfectly matched up with pin*() calls.
> + * Of course, once the lower bits overflow into the upper bits (and this is
> + * OK, because subtraction recovers the original values), then visual inspection
> + * no longer suffices to directly view the separate counts. However, for normal
> + * applications that don't have huge page reference counts, this won't be an
> + * issue.
> + *
> + * Locking: the lockless algorithm described in page_cache_get_speculative()
> + * and page_cache_gup_pin_speculative() provides safe operation for
> + * get_user_pages and page_mkclean and other calls that race to set up page
> + * table entries.
>   */
...
> @@ -2070,9 +2191,16 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>  	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
>  	refs = __record_subpages(page, addr, end, pages + *nr);
>  
> -	head = try_get_compound_head(head, refs);
> -	if (!head)
> -		return 0;
> +	if (flags & FOLL_PIN) {
> +		head = page;
> +		if (unlikely(!user_page_ref_inc(head)))
> +			return 0;
> +		head = page;

Why do you assign 'head' twice? Also the refcounting logic is repeated
several times so perhaps you can factor it out in to a helper function or
even move it to __record_subpages()?

> +	} else {
> +		head = try_get_compound_head(head, refs);
> +		if (!head)
> +			return 0;
> +	}
>  
>  	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>  		put_compound_head(head, refs);

So this will do the wrong thing for FOLL_PIN. We took just one "pin"
reference there but here we'll release 'refs' normal references AFAICT.
Also the fact that you take just one pin reference for each huge page
substantially changes how GUP refcounting works in the huge page case.
Currently, FOLL_GET users can be completely agnostic of huge pages. So you
can e.g. GUP whole 2 MB page, submit it as 2 different bios and then
drop page references from each bio completion function. With your new
FOLL_PIN behavior you cannot do that and I believe it will be a problem for
some users. So I think you have to maintain the behavior that you increase
the head->_refcount by (refs * GUP_PIN_COUNTING_BIAS) here.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Michal Hocko" <mhocko@suse.com>, "Jan Kara" <jack@suse.cz>,
	kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	"David Airlie" <airlied@linux.ie>,
	"Dave Chinner" <david@fromorbit.com>,
	dri-devel@lists.freedesktop.org,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, "Paul Mackerras" <paulus@samba.org>,
	linux-kselftest@vger.kernel.org,
	"Ira Weiny" <ira.weiny@intel.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	linux-rdma@vger.kernel.org,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	linux-media@vger.kernel.org, "Shuah Khan" <shuah@kernel.org>,
	linux-block@vger.kernel.org, "Jérôme Glisse" <jglisse@redhat.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Mauro Carvalho Chehab" <mchehab@ke>
Subject: Re: [PATCH v5 17/24] mm/gup: track FOLL_PIN pages
Date: Mon, 18 Nov 2019 12:58:29 +0100	[thread overview]
Message-ID: <20191118115829.GJ17319@quack2.suse.cz> (raw)
In-Reply-To: <20191115055340.1825745-18-jhubbard@nvidia.com>

On Thu 14-11-19 21:53:33, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via put_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>    bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1].
						^^ missing this reference
in the changelog...

> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6588d2e02628..db872766480f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1054,6 +1054,8 @@ static inline __must_check bool try_get_page(struct page *page)
>  	return true;
>  }
>  
> +__must_check bool user_page_ref_inc(struct page *page);
> +
>  static inline void put_page(struct page *page)
>  {
>  	page = compound_head(page);
> @@ -1071,29 +1073,70 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> -/**
> - * put_user_page() - release a gup-pinned page
> - * @page:            pointer to page to be released
> +/*
> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
> + * the page's refcount so that two separate items are tracked: the original page
> + * reference count, and also a new count of how many get_user_pages() calls were
							^^ pin_user_pages()

> + * made against the page. ("gup-pinned" is another term for the latter).
> + *
> + * With this scheme, get_user_pages() becomes special: such pages are marked
			^^^ pin_user_pages()

> + * as distinct from normal pages. As such, the put_user_page() call (and its
> + * variants) must be used in order to release gup-pinned pages.
> + *
> + * Choice of value:
>   *
> - * Pages that were pinned via pin_user_pages*() must be released via either
> - * put_user_page(), or one of the put_user_pages*() routines. This is so that
> - * eventually such pages can be separately tracked and uniquely handled. In
> - * particular, interactions with RDMA and filesystems need special handling.
> + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
> + * counts with respect to get_user_pages() and put_user_page() becomes simpler,
				^^^ pin_user_pages()

> + * due to the fact that adding an even power of two to the page refcount has
> + * the effect of using only the upper N bits, for the code that counts up using
> + * the bias value. This means that the lower bits are left for the exclusive
> + * use of the original code that increments and decrements by one (or at least,
> + * by much smaller values than the bias value).
>   *
> - * put_user_page() and put_page() are not interchangeable, despite this early
> - * implementation that makes them look the same. put_user_page() calls must
> - * be perfectly matched up with pin*() calls.
> + * Of course, once the lower bits overflow into the upper bits (and this is
> + * OK, because subtraction recovers the original values), then visual inspection
> + * no longer suffices to directly view the separate counts. However, for normal
> + * applications that don't have huge page reference counts, this won't be an
> + * issue.
> + *
> + * Locking: the lockless algorithm described in page_cache_get_speculative()
> + * and page_cache_gup_pin_speculative() provides safe operation for
> + * get_user_pages and page_mkclean and other calls that race to set up page
> + * table entries.
>   */
...
> @@ -2070,9 +2191,16 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>  	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
>  	refs = __record_subpages(page, addr, end, pages + *nr);
>  
> -	head = try_get_compound_head(head, refs);
> -	if (!head)
> -		return 0;
> +	if (flags & FOLL_PIN) {
> +		head = page;
> +		if (unlikely(!user_page_ref_inc(head)))
> +			return 0;
> +		head = page;

Why do you assign 'head' twice? Also the refcounting logic is repeated
several times so perhaps you can factor it out in to a helper function or
even move it to __record_subpages()?

> +	} else {
> +		head = try_get_compound_head(head, refs);
> +		if (!head)
> +			return 0;
> +	}
>  
>  	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>  		put_compound_head(head, refs);

So this will do the wrong thing for FOLL_PIN. We took just one "pin"
reference there but here we'll release 'refs' normal references AFAICT.
Also the fact that you take just one pin reference for each huge page
substantially changes how GUP refcounting works in the huge page case.
Currently, FOLL_GET users can be completely agnostic of huge pages. So you
can e.g. GUP whole 2 MB page, submit it as 2 different bios and then
drop page references from each bio completion function. With your new
FOLL_PIN behavior you cannot do that and I believe it will be a problem for
some users. So I think you have to maintain the behavior that you increase
the head->_refcount by (refs * GUP_PIN_COUNTING_BIAS) here.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Michal Hocko" <mhocko@suse.com>, "Jan Kara" <jack@suse.cz>,
	kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	"David Airlie" <airlied@linux.ie>,
	"Dave Chinner" <david@fromorbit.com>,
	dri-devel@lists.freedesktop.org,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, "Paul Mackerras" <paulus@samba.org>,
	linux-kselftest@vger.kernel.org,
	"Ira Weiny" <ira.weiny@intel.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	linux-rdma@vger.kernel.org,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	linux-media@vger.kernel.org, "Shuah Khan" <shuah@kernel.org>,
	linux-block@vger.kernel.org, "Jérôme Glisse" <jglisse@redhat.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Mauro Carvalho Chehab" <mchehab@kernel.org>,
	bpf@vger.kernel.org,
	"Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Jens Axboe" <axboe@kernel.dk>,
	netdev@vger.kernel.org,
	"Alex Williamson" <alex.williamson@redhat.com>,
	linux-fsdevel@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linuxppc-dev@lists.ozlabs.org,
	"David S . Miller" <davem@davemloft.net>,
	"Mike Kravetz" <mike.kravetz@oracle.com>
Subject: Re: [PATCH v5 17/24] mm/gup: track FOLL_PIN pages
Date: Mon, 18 Nov 2019 12:58:29 +0100	[thread overview]
Message-ID: <20191118115829.GJ17319@quack2.suse.cz> (raw)
Message-ID: <20191118115829.hlZXBcdKjVwmttKgtNqaheaoG-kXLt5UcJO7o34YOlA@z> (raw)
In-Reply-To: <20191115055340.1825745-18-jhubbard@nvidia.com>

On Thu 14-11-19 21:53:33, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via put_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>    bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1].
						^^ missing this reference
in the changelog...

> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6588d2e02628..db872766480f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1054,6 +1054,8 @@ static inline __must_check bool try_get_page(struct page *page)
>  	return true;
>  }
>  
> +__must_check bool user_page_ref_inc(struct page *page);
> +
>  static inline void put_page(struct page *page)
>  {
>  	page = compound_head(page);
> @@ -1071,29 +1073,70 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> -/**
> - * put_user_page() - release a gup-pinned page
> - * @page:            pointer to page to be released
> +/*
> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
> + * the page's refcount so that two separate items are tracked: the original page
> + * reference count, and also a new count of how many get_user_pages() calls were
							^^ pin_user_pages()

> + * made against the page. ("gup-pinned" is another term for the latter).
> + *
> + * With this scheme, get_user_pages() becomes special: such pages are marked
			^^^ pin_user_pages()

> + * as distinct from normal pages. As such, the put_user_page() call (and its
> + * variants) must be used in order to release gup-pinned pages.
> + *
> + * Choice of value:
>   *
> - * Pages that were pinned via pin_user_pages*() must be released via either
> - * put_user_page(), or one of the put_user_pages*() routines. This is so that
> - * eventually such pages can be separately tracked and uniquely handled. In
> - * particular, interactions with RDMA and filesystems need special handling.
> + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
> + * counts with respect to get_user_pages() and put_user_page() becomes simpler,
				^^^ pin_user_pages()

> + * due to the fact that adding an even power of two to the page refcount has
> + * the effect of using only the upper N bits, for the code that counts up using
> + * the bias value. This means that the lower bits are left for the exclusive
> + * use of the original code that increments and decrements by one (or at least,
> + * by much smaller values than the bias value).
>   *
> - * put_user_page() and put_page() are not interchangeable, despite this early
> - * implementation that makes them look the same. put_user_page() calls must
> - * be perfectly matched up with pin*() calls.
> + * Of course, once the lower bits overflow into the upper bits (and this is
> + * OK, because subtraction recovers the original values), then visual inspection
> + * no longer suffices to directly view the separate counts. However, for normal
> + * applications that don't have huge page reference counts, this won't be an
> + * issue.
> + *
> + * Locking: the lockless algorithm described in page_cache_get_speculative()
> + * and page_cache_gup_pin_speculative() provides safe operation for
> + * get_user_pages and page_mkclean and other calls that race to set up page
> + * table entries.
>   */
...
> @@ -2070,9 +2191,16 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>  	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
>  	refs = __record_subpages(page, addr, end, pages + *nr);
>  
> -	head = try_get_compound_head(head, refs);
> -	if (!head)
> -		return 0;
> +	if (flags & FOLL_PIN) {
> +		head = page;
> +		if (unlikely(!user_page_ref_inc(head)))
> +			return 0;
> +		head = page;

Why do you assign 'head' twice? Also the refcounting logic is repeated
several times so perhaps you can factor it out in to a helper function or
even move it to __record_subpages()?

> +	} else {
> +		head = try_get_compound_head(head, refs);
> +		if (!head)
> +			return 0;
> +	}
>  
>  	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>  		put_compound_head(head, refs);

So this will do the wrong thing for FOLL_PIN. We took just one "pin"
reference there but here we'll release 'refs' normal references AFAICT.
Also the fact that you take just one pin reference for each huge page
substantially changes how GUP refcounting works in the huge page case.
Currently, FOLL_GET users can be completely agnostic of huge pages. So you
can e.g. GUP whole 2 MB page, submit it as 2 different bios and then
drop page references from each bio completion function. With your new
FOLL_PIN behavior you cannot do that and I believe it will be a problem for
some users. So I think you have to maintain the behavior that you increase
the head->_refcount by (refs * GUP_PIN_COUNTING_BIAS) here.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

  reply	other threads:[~2019-11-18 11:58 UTC|newest]

Thread overview: 154+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-15  5:53 [PATCH v5 00/24] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
2019-11-15  5:53 ` John Hubbard
2019-11-15  5:53 ` John Hubbard
2019-11-15  5:53 ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 01/24] mm/gup: pass flags arg to __gup_device_* functions John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 02/24] mm/gup: factor out duplicate code from four routines John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18  9:46   ` Jan Kara
2019-11-18  9:46     ` Jan Kara
2019-11-18  9:46     ` Jan Kara
2019-11-18  9:46     ` Jan Kara
2019-11-19  7:00     ` John Hubbard
2019-11-19  7:00       ` John Hubbard
2019-11-19  7:00       ` John Hubbard
2019-11-19  7:00       ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 03/24] mm/gup: move try_get_compound_head() to top, fix minor issues John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 04/24] mm: Cleanup __put_devmap_managed_page() vs ->page_free() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 05/24] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 06/24] goldish_pipe: rename local pin_user_pages() routine John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18  9:47   ` Jan Kara
2019-11-18  9:47     ` Jan Kara
2019-11-18  9:47     ` Jan Kara
2019-11-18  9:47     ` Jan Kara
2019-11-15  5:53 ` [PATCH v5 07/24] IB/umem: use get_user_pages_fast() to pin DMA pages John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18  9:49   ` Jan Kara
2019-11-18  9:49     ` Jan Kara
2019-11-18  9:49     ` Jan Kara
2019-11-18  9:49     ` Jan Kara
2019-11-15  5:53 ` [PATCH v5 08/24] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 09/24] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15 14:08   ` Jason Gunthorpe
2019-11-15 14:08     ` Jason Gunthorpe
2019-11-15 14:08     ` Jason Gunthorpe
2019-11-15 18:06   ` Ira Weiny
2019-11-15 18:06     ` Ira Weiny
2019-11-15 18:06     ` Ira Weiny
2019-11-15 18:06     ` Ira Weiny
2019-11-15  5:53 ` [PATCH v5 10/24] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18 10:16   ` Jan Kara
2019-11-18 10:16     ` Jan Kara
2019-11-18 10:16     ` Jan Kara
2019-11-18 10:16     ` Jan Kara
2019-11-19  5:17     ` John Hubbard
2019-11-19  5:17       ` John Hubbard
2019-11-19  5:17       ` John Hubbard
2019-11-19  5:17       ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 11/24] goldish_pipe: convert to pin_user_pages() and put_user_page() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18 10:16   ` Jan Kara
2019-11-18 10:16     ` Jan Kara
2019-11-18 10:16     ` Jan Kara
2019-11-18 10:16     ` Jan Kara
2019-11-15  5:53 ` [PATCH v5 12/24] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP John Hubbard
2019-11-15  5:53   ` [PATCH v5 12/24] IB/{core, hw, umem}: " John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15 14:09   ` [PATCH v5 12/24] IB/{core,hw,umem}: " Jason Gunthorpe
2019-11-15 14:09     ` Jason Gunthorpe
2019-11-15 14:09     ` Jason Gunthorpe
2019-11-15  5:53 ` [PATCH v5 13/24] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18 10:30   ` Jan Kara
2019-11-18 10:30     ` Jan Kara
2019-11-18 10:30     ` Jan Kara
2019-11-18 10:30     ` Jan Kara
2019-11-15  5:53 ` [PATCH v5 14/24] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 15/24] fs/io_uring: set FOLL_PIN via pin_user_pages() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18 10:34   ` Jan Kara
2019-11-18 10:34     ` Jan Kara
2019-11-18 10:34     ` Jan Kara
2019-11-18 10:34     ` Jan Kara
2019-11-15  5:53 ` [PATCH v5 16/24] net/xdp: " John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 17/24] mm/gup: track FOLL_PIN pages John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18 11:58   ` Jan Kara [this message]
2019-11-18 11:58     ` Jan Kara
2019-11-18 11:58     ` Jan Kara
2019-11-18 11:58     ` Jan Kara
2019-11-19  0:22     ` John Hubbard
2019-11-19  0:22       ` John Hubbard
2019-11-19  0:22       ` John Hubbard
2019-11-19  0:22       ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 18/24] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 19/24] vfio, mm: " John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 20/24] powerpc: book3s64: convert to pin_user_pages() and put_user_page() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 21/24] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 22/24] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 23/24] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 24/24] mm, tree-wide: rename put_user_page*() to unpin_user_page*() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53   ` John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191118115829.GJ17319@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=airlied@linux.ie \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=benh@kernel.crashing.org \
    --cc=bjorn.topel@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=daniel@ffwll.ch \
    --cc=davem@davemloft.net \
    --cc=david@fromorbit.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@infradead.org \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=magnus.karlsson@intel.com \
    --cc=mchehab@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=mpe@ellerman.id.au \
    --cc=netdev@vger.kernel.org \
    --cc=paulus@samba.org \
    --cc=shuah@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.