dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: John Hubbard <jhubbard@nvidia.com>
To: Jan Kara <jack@suse.cz>
Cc: "Michal Hocko" <mhocko@suse.com>,
	kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	"David Airlie" <airlied@linux.ie>,
	"Dave Chinner" <david@fromorbit.com>,
	dri-devel@lists.freedesktop.org,
	LKML <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, "Paul Mackerras" <paulus@samba.org>,
	linux-kselftest@vger.kernel.org,
	"Ira Weiny" <ira.weiny@intel.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	linux-rdma@vger.kernel.org,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	linux-media@vger.kernel.org, "Shuah Khan" <shuah@kernel.org>,
	linux-block@vger.kernel.org, "Jérôme Glisse" <jglisse@redhat.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Mauro Carvalho Chehab" <mchehab@kernel.org>,
	bpf@vger.kernel.org,
	"Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Jens Axboe" <axboe@kernel.dk>,
	netdev@vger.kernel.org,
	"Alex Williamson" <alex.williamson@redhat.com>,
	linux-fsdevel@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linuxppc-dev@lists.ozlabs.org,
	"David S . Miller" <davem@davemloft.net>,
	"Mike Kravetz" <mike.kravetz@oracle.com>
Subject: Re: [PATCH v5 17/24] mm/gup: track FOLL_PIN pages
Date: Mon, 18 Nov 2019 16:22:43 -0800	[thread overview]
Message-ID: <8424f891-271d-5c34-8f7c-ebf3e3aa6664@nvidia.com> (raw)
Message-ID: <20191119002243.3CO8qRmN44aQTkd5ZbmV4iguGMCJ67zVaiLXWnTs9AE@z> (raw)
In-Reply-To: <20191118115829.GJ17319@quack2.suse.cz>

On 11/18/19 3:58 AM, Jan Kara wrote:
> On Thu 14-11-19 21:53:33, John Hubbard wrote:
>> Add tracking of pages that were pinned via FOLL_PIN.
>>
>> As mentioned in the FOLL_PIN documentation, callers who effectively set
>> FOLL_PIN are required to ultimately free such pages via put_user_page().
>> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
>> for DIO and/or RDMA use".
>>
>> Pages that have been pinned via FOLL_PIN are identifiable via a
>> new function call:
>>
>>    bool page_dma_pinned(struct page *page);
>>
>> What to do in response to encountering such a page, is left to later
>> patchsets. There is discussion about this in [1].
> 						^^ missing this reference
> in the changelog...

I'll add that. 

> 
>> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
>>
>> Suggested-by: Jan Kara <jack@suse.cz>
>> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>> ---
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 6588d2e02628..db872766480f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1054,6 +1054,8 @@ static inline __must_check bool try_get_page(struct page *page)
>>  	return true;
>>  }
>>  
>> +__must_check bool user_page_ref_inc(struct page *page);
>> +
>>  static inline void put_page(struct page *page)
>>  {
>>  	page = compound_head(page);
>> @@ -1071,29 +1073,70 @@ static inline void put_page(struct page *page)
>>  		__put_page(page);
>>  }
>>  
>> -/**
>> - * put_user_page() - release a gup-pinned page
>> - * @page:            pointer to page to be released
>> +/*
>> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
>> + * the page's refcount so that two separate items are tracked: the original page
>> + * reference count, and also a new count of how many get_user_pages() calls were
> 							^^ pin_user_pages()
> 
>> + * made against the page. ("gup-pinned" is another term for the latter).
>> + *
>> + * With this scheme, get_user_pages() becomes special: such pages are marked
> 			^^^ pin_user_pages()
> 
>> + * as distinct from normal pages. As such, the put_user_page() call (and its
>> + * variants) must be used in order to release gup-pinned pages.
>> + *
>> + * Choice of value:
>>   *
>> - * Pages that were pinned via pin_user_pages*() must be released via either
>> - * put_user_page(), or one of the put_user_pages*() routines. This is so that
>> - * eventually such pages can be separately tracked and uniquely handled. In
>> - * particular, interactions with RDMA and filesystems need special handling.
>> + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
>> + * counts with respect to get_user_pages() and put_user_page() becomes simpler,
> 				^^^ pin_user_pages()
> 

Yes.

>> + * due to the fact that adding an even power of two to the page refcount has
>> + * the effect of using only the upper N bits, for the code that counts up using
>> + * the bias value. This means that the lower bits are left for the exclusive
>> + * use of the original code that increments and decrements by one (or at least,
>> + * by much smaller values than the bias value).
>>   *
>> - * put_user_page() and put_page() are not interchangeable, despite this early
>> - * implementation that makes them look the same. put_user_page() calls must
>> - * be perfectly matched up with pin*() calls.
>> + * Of course, once the lower bits overflow into the upper bits (and this is
>> + * OK, because subtraction recovers the original values), then visual inspection
>> + * no longer suffices to directly view the separate counts. However, for normal
>> + * applications that don't have huge page reference counts, this won't be an
>> + * issue.
>> + *
>> + * Locking: the lockless algorithm described in page_cache_get_speculative()
>> + * and page_cache_gup_pin_speculative() provides safe operation for
>> + * get_user_pages and page_mkclean and other calls that race to set up page
>> + * table entries.
>>   */
> ...
>> @@ -2070,9 +2191,16 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>>  	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
>>  	refs = __record_subpages(page, addr, end, pages + *nr);
>>  
>> -	head = try_get_compound_head(head, refs);
>> -	if (!head)
>> -		return 0;
>> +	if (flags & FOLL_PIN) {
>> +		head = page;
>> +		if (unlikely(!user_page_ref_inc(head)))
>> +			return 0;
>> +		head = page;
> 
> Why do you assign 'head' twice? Also the refcounting logic is repeated
> several times so perhaps you can factor it out in to a helper function or
> even move it to __record_subpages()?

OK.

> 
>> +	} else {
>> +		head = try_get_compound_head(head, refs);
>> +		if (!head)
>> +			return 0;
>> +	}
>>  
>>  	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>>  		put_compound_head(head, refs);
> 
> So this will do the wrong thing for FOLL_PIN. We took just one "pin"
> reference there but here we'll release 'refs' normal references AFAICT.
> Also the fact that you take just one pin reference for each huge page
> substantially changes how GUP refcounting works in the huge page case.
> Currently, FOLL_GET users can be completely agnostic of huge pages. So you
> can e.g. GUP whole 2 MB page, submit it as 2 different bios and then
> drop page references from each bio completion function. With your new
> FOLL_PIN behavior you cannot do that and I believe it will be a problem for
> some users. So I think you have to maintain the behavior that you increase
> the head->_refcount by (refs * GUP_PIN_COUNTING_BIAS) here.
> 

Yes, completely agreed, this was a (big) oversight. I went through the same
reasoning and reached your conclusions, in __gup_device_huge(), but then
did it wrong in these functions. Will fix.

thanks,
-- 
John Hubbard
NVIDIA
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

  parent reply	other threads:[~2019-11-19  0:22 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-15  5:53 [PATCH v5 00/24] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
2019-11-15  5:53 ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 01/24] mm/gup: pass flags arg to __gup_device_* functions John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 02/24] mm/gup: factor out duplicate code from four routines John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18  9:46   ` Jan Kara
2019-11-18  9:46     ` Jan Kara
2019-11-19  7:00     ` John Hubbard
2019-11-19  7:00       ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 03/24] mm/gup: move try_get_compound_head() to top, fix minor issues John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 04/24] mm: Cleanup __put_devmap_managed_page() vs ->page_free() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 05/24] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 06/24] goldish_pipe: rename local pin_user_pages() routine John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18  9:47   ` Jan Kara
2019-11-18  9:47     ` Jan Kara
2019-11-15  5:53 ` [PATCH v5 07/24] IB/umem: use get_user_pages_fast() to pin DMA pages John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18  9:49   ` Jan Kara
2019-11-18  9:49     ` Jan Kara
2019-11-15  5:53 ` [PATCH v5 08/24] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 09/24] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15 14:08   ` Jason Gunthorpe
2019-11-15 18:06   ` Ira Weiny
2019-11-15 18:06     ` Ira Weiny
2019-11-15  5:53 ` [PATCH v5 10/24] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18 10:16   ` Jan Kara
2019-11-18 10:16     ` Jan Kara
2019-11-19  5:17     ` John Hubbard
2019-11-19  5:17       ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 11/24] goldish_pipe: convert to pin_user_pages() and put_user_page() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18 10:16   ` Jan Kara
2019-11-18 10:16     ` Jan Kara
2019-11-15  5:53 ` [PATCH v5 12/24] IB/{core, hw, umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15 14:09   ` [PATCH v5 12/24] IB/{core,hw,umem}: " Jason Gunthorpe
2019-11-15  5:53 ` [PATCH v5 13/24] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18 10:30   ` Jan Kara
2019-11-18 10:30     ` Jan Kara
2019-11-15  5:53 ` [PATCH v5 14/24] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 15/24] fs/io_uring: set FOLL_PIN via pin_user_pages() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18 10:34   ` Jan Kara
2019-11-18 10:34     ` Jan Kara
2019-11-15  5:53 ` [PATCH v5 16/24] net/xdp: " John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 17/24] mm/gup: track FOLL_PIN pages John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-18 11:58   ` Jan Kara
2019-11-18 11:58     ` Jan Kara
2019-11-19  0:22     ` John Hubbard [this message]
2019-11-19  0:22       ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 18/24] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 19/24] vfio, mm: " John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 20/24] powerpc: book3s64: convert to pin_user_pages() and put_user_page() John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 21/24] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 22/24] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 23/24] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
2019-11-15  5:53   ` John Hubbard
2019-11-15  5:53 ` [PATCH v5 24/24] mm, tree-wide: rename put_user_page*() to unpin_user_page*() John Hubbard
2019-11-15  5:53   ` John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8424f891-271d-5c34-8f7c-ebf3e3aa6664@nvidia.com \
    --to=jhubbard@nvidia.com \
    --cc=airlied@linux.ie \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bjorn.topel@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=davem@davemloft.net \
    --cc=david@fromorbit.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@infradead.org \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=magnus.karlsson@intel.com \
    --cc=mchehab@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=mpe@ellerman.id.au \
    --cc=netdev@vger.kernel.org \
    --cc=paulus@samba.org \
    --cc=shuah@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).