LinuxPPC-Dev Archive on lore.kernel.org
 help / color / Atom feed
From: John Hubbard <jhubbard@nvidia.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: "Michal Hocko" <mhocko@suse.com>, "Jan Kara" <jack@suse.cz>,
	kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	"David Airlie" <airlied@linux.ie>,
	"Dave Chinner" <david@fromorbit.com>,
	dri-devel@lists.freedesktop.org, linux-mm@kvack.org,
	"Paul Mackerras" <paulus@samba.org>,
	linux-kselftest@vger.kernel.org,
	"Ira Weiny" <ira.weiny@intel.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	linux-rdma@vger.kernel.org,
	"Christoph Hellwig" <hch@infradead.org>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	linux-media@vger.kernel.org, "Shuah Khan" <shuah@kernel.org>,
	linux-block@vger.kernel.org,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Mauro Carvalho Chehab" <mchehab@kernel.org>,
	bpf@vger.kernel.org,
	"Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Jens Axboe" <axboe@kernel.dk>,
	netdev@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	linux-fsdevel@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>,
	linuxppc-dev@lists.ozlabs.org,
	"David S . Miller" <davem@davemloft.net>,
	"Mike Kravetz" <mike.kravetz@oracle.com>
Subject: Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages
Date: Mon, 4 Nov 2019 14:49:18 -0800
Message-ID: <7821cf87-75a8-45e2-cf28-f85b62192416@nvidia.com> (raw)
In-Reply-To: <20191104185238.GG5134@redhat.com>

On 11/4/19 10:52 AM, Jerome Glisse wrote:
> On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
>> Add tracking of pages that were pinned via FOLL_PIN.
>>
>> As mentioned in the FOLL_PIN documentation, callers who effectively set
>> FOLL_PIN are required to ultimately free such pages via put_user_page().
>> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
>> for DIO and/or RDMA use".
>>
>> Pages that have been pinned via FOLL_PIN are identifiable via a
>> new function call:
>>
>>    bool page_dma_pinned(struct page *page);
>>
>> What to do in response to encountering such a page, is left to later
>> patchsets. There is discussion about this in [1].
>>
>> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
>>
>> This also has a couple of trivial, non-functional change fixes to
>> try_get_compound_head(). That function got moved to the top of the
>> file.
> 
> Maybe split that as a separate trivial patch.


Will do.


> 
>>
>> This includes the following fix from Ira Weiny:
>>
>> DAX requires detection of a page crossing to a ref count of 1.  Fix this
>> for GUP pages by introducing put_devmap_managed_user_page() which
>> accounts for GUP_PIN_COUNTING_BIAS now used by GUP.
> 
> Please do the put_devmap_managed_page() changes in a separate
> patch, it would be a lot easier to follow, also on that front
> see comments below.


Oh! OK. It makes sense when you say it out loud. :)


...
>> +static inline bool put_devmap_managed_page(struct page *page)
>> +{
>> +	bool is_devmap = page_is_devmap_managed(page);
>> +
>> +	if (is_devmap) {
>> +		int count = page_ref_dec_return(page);
>> +
>> +		__put_devmap_managed_page(page, count);
>> +	}
>> +
>> +	return is_devmap;
>> +}
> 
> I think the __put_devmap_managed_page() should be rename
> to free_devmap_managed_page() and that the count != 1
> case move to this inline function ie:
> 
> static inline bool put_devmap_managed_page(struct page *page)
> {
> 	bool is_devmap = page_is_devmap_managed(page);
> 
> 	if (is_devmap) {
> 		int count = page_ref_dec_return(page);
> 
> 		/*
> 		 * If refcount is 1 then page is freed and refcount is stable as nobody
> 		 * holds a reference on the page.
> 		 */
> 		if (count == 1)
> 			free_devmap_managed_page(page, count);
> 		else if (!count)
> 			__put_page(page);
> 	}
> 
> 	return is_devmap;
> }
> 

Thanks, that does look cleaner and easier to read.

> 
>> +
>>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>>  static inline bool put_devmap_managed_page(struct page *page)
>>  {
>> @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page)
>>  	return true;
>>  }
>>  
>> +__must_check bool user_page_ref_inc(struct page *page);
>> +
> 
> What about having it as an inline here as it is pretty small.


You mean move it to a static inline function in mm.h? It's worse than it 
looks, though: *everything* that it calls is also a static function, local
to gup.c. So I'd have to expose both try_get_compound_head() and
__update_proc_vmstat(). And that also means calling mod_node_page_state() from
mm.h, and it goes south right about there. :)


...  
>> +/**
>> + * page_dma_pinned() - report if a page is pinned by a call to pin_user_pages*()
>> + * or pin_longterm_pages*()
>> + * @page:	pointer to page to be queried.
>> + * @Return:	True, if it is likely that the page has been "dma-pinned".
>> + *		False, if the page is definitely not dma-pinned.
>> + */
> 
> Maybe add a small comment about wrap around :)


I don't *think* the count can wrap around, due to the checks in user_page_ref_inc().

But it's true that the documentation is a little light here...What did you have 
in mind?


> [...]
> 
>> @@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>  
>>  		pgmap = get_dev_pagemap(pfn, pgmap);
>>  		if (unlikely(!pgmap)) {
>> -			undo_dev_pagemap(nr, nr_start, pages);
>> +			undo_dev_pagemap(nr, nr_start, flags, pages);
>>  			return 0;
>>  		}
>>  		SetPageReferenced(page);
>>  		pages[*nr] = page;
>> -		get_page(page);
>> +
>> +		if (flags & FOLL_PIN) {
>> +			if (unlikely(!user_page_ref_inc(page))) {
>> +				undo_dev_pagemap(nr, nr_start, flags, pages);
>> +				return 0;
>> +			}
> 
> Maybe add a comment about a case that should never happens ie
> user_page_ref_inc() fails after the second iteration of the
> loop as it would be broken and a bug to call undo_dev_pagemap()
> after the first iteration of that loop.
> 
> Also i believe that this should never happens as if first
> iteration succeed than __page_cache_add_speculative() will
> succeed for all the iterations.
> 
> Note that the pgmap case above follows that too ie the call to
> get_dev_pagemap() can only fail on first iteration of the loop,
> well i assume you can never have a huge device page that span
> different pgmap ie different devices (which is a reasonable
> assumption). So maybe this code needs fixing ie :
> 
> 		pgmap = get_dev_pagemap(pfn, pgmap);
> 		if (unlikely(!pgmap))
> 			return 0;
> 
> 

OK, yes that does make sense. And I think a comment is adequate,
no need to check for bugs during every tail page iteration. So how 
about this, as a preliminary patch:

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..a4a81e125832 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1892,17 +1892,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
                unsigned long end, struct page **pages, int *nr)
 {
-       int nr_start = *nr;
-       struct dev_pagemap *pgmap = NULL;
+       /*
+        * Huge pages should never cross dev_pagemap boundaries. Therefore, use
+        * this same pgmap for the entire huge page.
+        */
+       struct dev_pagemap *pgmap = get_dev_pagemap(pfn, NULL);
+
+       if (unlikely(!pgmap))
+               return 0;
 
        do {
                struct page *page = pfn_to_page(pfn);
 
-               pgmap = get_dev_pagemap(pfn, pgmap);
-               if (unlikely(!pgmap)) {
-                       undo_dev_pagemap(nr, nr_start, pages);
-                       return 0;
-               }
                SetPageReferenced(page);
                pages[*nr] = page;
                get_page(page);




>> +		} else
>> +			get_page(page);
>> +
>>  		(*nr)++;
>>  		pfn++;
>>  	} while (addr += PAGE_SIZE, addr != end);
> 
> [...]
> 
>> @@ -2409,7 +2540,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
>>  	unsigned long addr, len, end;
>>  	int nr = 0, ret = 0;
>>  
>> -	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
>> +	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
> 
> Maybe add a comments to explain, something like:
> 
> /*
>  * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN
>  *
>  * Note that get_user_pages_fast() imply FOLL_GET flag by default but
>  * callers can over-ride this default to pin case by setting FOLL_PIN.
>  */

Good idea. Here's the draft now:

/*
 * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN.
 *
 * Note that get_user_pages_fast() implies FOLL_GET flag by default, but
 * callers can override this default by setting FOLL_PIN instead of
 * FOLL_GET.
 */
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
        return -EINVAL;

> 
>>  		return -EINVAL;
>>  
>>  	start = untagged_addr(start) & PAGE_MASK;
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 13cc93785006..66bf4c8b88f1 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
> 
> [...]
> 
>> @@ -968,7 +973,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>>  	if (!*pgmap)
>>  		return ERR_PTR(-EFAULT);
>>  	page = pfn_to_page(pfn);
>> -	get_page(page);
>> +
>> +	if (flags & FOLL_GET)
>> +		get_page(page);
>> +	else if (flags & FOLL_PIN)
>> +		if (unlikely(!user_page_ref_inc(page)))
>> +			page = ERR_PTR(-ENOMEM);
> 
> While i agree that user_page_ref_inc() (ie page_cache_add_speculative())
> should never fails here as we are holding the pmd lock and thus no one
> can unmap the pmd and free the page it points to. I believe you should
> return -EFAULT like for the pgmap and not -ENOMEM as the pgmap should
> not fail either for the same reason. Thus it would be better to have
> consistent error. Maybe also add a comments explaining that it should
> not fail here.
> 

OK. I'll take a pass through and fix up the remaining points about these
sorts of cases below, as well, in v3. Those all make sense.

>>  
>>  	return page;
>>  }
> 
> [...]
> 
>> @@ -1100,7 +1115,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>>  	 * device mapped pages can only be returned if the
>>  	 * caller will manage the page reference count.
>>  	 */
>> -	if (!(flags & FOLL_GET))
>> +	if (!(flags & (FOLL_GET | FOLL_PIN)))
>>  		return ERR_PTR(-EEXIST);
> 
> Maybe add a comment that FOLL_GET or FOLL_PIN must be set.
> 
>>  	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
>> @@ -1108,7 +1123,12 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>>  	if (!*pgmap)
>>  		return ERR_PTR(-EFAULT);
>>  	page = pfn_to_page(pfn);
>> -	get_page(page);
>> +
>> +	if (flags & FOLL_GET)
>> +		get_page(page);
>> +	else if (flags & FOLL_PIN)
>> +		if (unlikely(!user_page_ref_inc(page)))
>> +			page = ERR_PTR(-ENOMEM);
> 
> Same as for follow_devmap_pmd() see above.
> 
>>  
>>  	return page;
>>  }
>> @@ -1522,8 +1542,12 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>>  skip_mlock:
>>  	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
>>  	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
>> +
>>  	if (flags & FOLL_GET)
>>  		get_page(page);
>> +	else if (flags & FOLL_PIN)
>> +		if (unlikely(!user_page_ref_inc(page)))
>> +			page = NULL;
> 
> This should not fail either as we are holding the pmd lock maybe add
> a comment. Dunno if we want a WARN() or something to catch this
> degenerate case, or dump the page.
> 
>>  
>>  out:
>>  	return page;
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index b45a95363a84..da335b1cd798 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4462,7 +4462,17 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>>  same_page:
>>  		if (pages) {
>>  			pages[i] = mem_map_offset(page, pfn_offset);
>> -			get_page(pages[i]);
>> +
>> +			if (flags & FOLL_GET)
>> +				get_page(pages[i]);
>> +			else if (flags & FOLL_PIN)
>> +				if (unlikely(!user_page_ref_inc(pages[i]))) {
>> +					spin_unlock(ptl);
>> +					remainder = 0;
>> +					err = -ENOMEM;
>> +					WARN_ON_ONCE(1);
>> +					break;
>> +				}
>>  		}
> 
> user_page_ref_inc() should not fail here either because we hold the
> ptl, so the WAR_ON_ONCE() is right but maybe add a comment.
> 
>>  
>>  		if (vmas)
> 
> [...]
> 
>> @@ -5034,8 +5050,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>>  	pte = huge_ptep_get((pte_t *)pmd);
>>  	if (pte_present(pte)) {
>>  		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
>> +
>>  		if (flags & FOLL_GET)
>>  			get_page(page);
>> +		else if (flags & FOLL_PIN)
>> +			if (unlikely(!user_page_ref_inc(page))) {
>> +				page = NULL;
>> +				goto out;
>> +			}
> 
> This should not fail either (again holding pmd lock), dunno if we want
> a warn or something to catch this degenerate case.
> 
>>  	} else {
>>  		if (is_hugetlb_entry_migration(pte)) {
>>  			spin_unlock(ptl);
> 
> [...]
> 
> 

Those are all good points, working on them now.

thanks,
-- 
John Hubbard
NVIDIA

  reply index

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
2019-11-03 21:17 ` [PATCH v2 01/18] mm/gup: pass flags arg to __gup_device_* functions John Hubbard
2019-11-04 16:39   ` Jerome Glisse
2019-11-03 21:17 ` [PATCH v2 02/18] mm/gup: factor out duplicate code from four routines John Hubbard
2019-11-04 16:51   ` Jerome Glisse
2019-11-03 21:17 ` [PATCH v2 03/18] goldish_pipe: rename local pin_user_pages() routine John Hubbard
2019-11-04 16:52   ` Jerome Glisse
2019-11-03 21:17 ` [PATCH v2 04/18] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
2019-11-10 10:10   ` Hans Verkuil
2019-11-11 21:46     ` John Hubbard
2019-11-03 21:18 ` [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
2019-11-04 17:33   ` Jerome Glisse
2019-11-04 19:04     ` John Hubbard
2019-11-04 19:18       ` Jerome Glisse
2019-11-04 19:30         ` John Hubbard
2019-11-04 19:52           ` Jerome Glisse
2019-11-04 20:09             ` John Hubbard
2019-11-04 20:31               ` Jason Gunthorpe
2019-11-04 20:40                 ` John Hubbard
2019-11-04 20:31               ` Jerome Glisse
2019-11-04 20:37                 ` Jason Gunthorpe
2019-11-04 20:57                   ` John Hubbard
2019-11-04 21:15                     ` Jason Gunthorpe
2019-11-04 21:34                       ` John Hubbard
2019-11-04 20:33   ` David Rientjes
2019-11-04 20:48     ` Jerome Glisse
2019-11-05 13:10   ` Mike Rapoport
2019-11-05 19:00     ` John Hubbard
2019-11-07  2:25       ` Ira Weiny
2019-11-07  8:07       ` Mike Rapoport
2019-11-03 21:18 ` [PATCH v2 06/18] goldish_pipe: convert to pin_user_pages() and put_user_page() John Hubbard
2019-11-03 21:18 ` [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*() John Hubbard
2019-11-04 20:33   ` Jason Gunthorpe
2019-11-04 20:48     ` John Hubbard
2019-11-04 20:57       ` Jason Gunthorpe
2019-11-04 22:03         ` John Hubbard
2019-11-05  2:32           ` Jason Gunthorpe
2019-11-07  2:26         ` Ira Weiny
2019-11-03 21:18 ` [PATCH v2 08/18] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
2019-11-04 17:41   ` Jerome Glisse
2019-11-03 21:18 ` [PATCH v2 09/18] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
2019-11-04 17:44   ` Jerome Glisse
2019-11-04 18:22     ` John Hubbard
2019-11-03 21:18 ` [PATCH v2 10/18] fs/io_uring: set FOLL_PIN via pin_user_pages() John Hubbard
2019-11-03 21:18 ` [PATCH v2 11/18] net/xdp: " John Hubbard
2019-11-03 21:18 ` [PATCH v2 12/18] mm/gup: track FOLL_PIN pages John Hubbard
2019-11-04 18:52   ` Jerome Glisse
2019-11-04 22:49     ` John Hubbard [this message]
2019-11-04 23:49       ` Jerome Glisse
2019-11-05  0:18         ` John Hubbard
2019-11-03 21:18 ` [PATCH v2 13/18] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
2019-11-10 10:11   ` Hans Verkuil
2019-11-03 21:18 ` [PATCH v2 14/18] vfio, mm: " John Hubbard
2019-11-03 21:18 ` [PATCH v2 15/18] powerpc: book3s64: convert to pin_longterm_pages() and put_user_page() John Hubbard
2019-11-03 21:18 ` [PATCH v2 16/18] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
2019-11-03 21:18 ` [PATCH v2 17/18] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
2019-11-03 21:18 ` [PATCH v2 18/18] mm/gup: remove support for gup(FOLL_LONGTERM) John Hubbard

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7821cf87-75a8-45e2-cf28-f85b62192416@nvidia.com \
    --to=jhubbard@nvidia.com \
    --cc=airlied@linux.ie \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bjorn.topel@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=daniel@ffwll.ch \
    --cc=davem@davemloft.net \
    --cc=david@fromorbit.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@infradead.org \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=magnus.karlsson@intel.com \
    --cc=mchehab@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=netdev@vger.kernel.org \
    --cc=paulus@samba.org \
    --cc=shuah@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org
	public-inbox-index linuxppc-dev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git