BPF Archive on lore.kernel.org
 help / color / Atom feed
From: Jan Kara <jack@suse.cz>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Al Viro" <viro@zeniv.linux.org.uk>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	"Christoph Hellwig" <hch@infradead.org>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	"Dave Chinner" <david@fromorbit.com>,
	"David Airlie" <airlied@linux.ie>,
	"David S . Miller" <davem@davemloft.net>,
	"Ira Weiny" <ira.weiny@intel.com>, "Jan Kara" <jack@suse.cz>,
	"Jason Gunthorpe" <jgg@ziepe.ca>, "Jens Axboe" <axboe@kernel.dk>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Jérôme Glisse" <jglisse@redhat.com>,
	"Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Mauro Carvalho Chehab" <mchehab@kernel.org>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Michal Hocko" <mhocko@suse.com>,
	"Mike Kravetz" <mike.kravetz@oracle.com>,
	"Paul Mackerras" <paulus@samba.org>,
	"Shuah Khan" <shuah@kernel.org>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	bpf@vger.kernel.org, dri-devel@lists.freedesktop.org,
	kvm@vger.kernel.org, linux-block@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org,
	linux-rdma@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	netdev@vger.kernel.org, linux-mm@kvack.org,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v7 17/24] mm/gup: track FOLL_PIN pages
Date: Thu, 21 Nov 2019 10:39:41 +0100
Message-ID: <20191121093941.GA18190@quack2.suse.cz> (raw)
In-Reply-To: <20191121071354.456618-18-jhubbard@nvidia.com>

On Wed 20-11-19 23:13:47, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via put_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>    bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
>     https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
>     https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
>     https://lwn.net/Articles/753027/
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Thanks for the patch! We are mostly getting there. Some smaller comments
below.

> +/**
> + * try_pin_compound_head() - mark a compound page as being used by
> + * pin_user_pages*().
> + *
> + * This is the FOLL_PIN counterpart to try_get_compound_head().
> + *
> + * @page:	pointer to page to be marked
> + * @Return:	true for success, false for failure
> + */
> +__must_check bool try_pin_compound_head(struct page *page, int refs)
> +{
> +	page = try_get_compound_head(page, GUP_PIN_COUNTING_BIAS * refs);
> +	if (!page)
> +		return false;
> +
> +	__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, refs);
> +	return true;
> +}
> +
> +#ifdef CONFIG_DEV_PAGEMAP_OPS
> +static bool __put_devmap_managed_user_page(struct page *page)

Probably call this __unpin_devmap_managed_user_page()? To match the later
conversion of put_user_page() to unpin_user_page()?

> +{
> +	bool is_devmap = page_is_devmap_managed(page);
> +
> +	if (is_devmap) {
> +		int count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
> +
> +		__update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
> +		/*
> +		 * devmap page refcounts are 1-based, rather than 0-based: if
> +		 * refcount is 1, then the page is free and the refcount is
> +		 * stable because nobody holds a reference on the page.
> +		 */
> +		if (count == 1)
> +			free_devmap_managed_page(page);
> +		else if (!count)
> +			__put_page(page);
> +	}
> +
> +	return is_devmap;
> +}
> +#else
> +static bool __put_devmap_managed_user_page(struct page *page)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_DEV_PAGEMAP_OPS */
> +
> +/**
> + * put_user_page() - release a dma-pinned page
> + * @page:            pointer to page to be released
> + *
> + * Pages that were pinned via pin_user_pages*() must be released via either
> + * put_user_page(), or one of the put_user_pages*() routines. This is so that
> + * such pages can be separately tracked and uniquely handled. In particular,
> + * interactions with RDMA and filesystems need special handling.
> + */
> +void put_user_page(struct page *page)
> +{
> +	page = compound_head(page);
> +
> +	/*
> +	 * For devmap managed pages we need to catch refcount transition from
> +	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
> +	 * page is free and we need to inform the device driver through
> +	 * callback. See include/linux/memremap.h and HMM for details.
> +	 */
> +	if (__put_devmap_managed_user_page(page))
> +		return;
> +
> +	if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
> +		__put_page(page);
> +
> +	__update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
> +}
> +EXPORT_SYMBOL(put_user_page);
> +
>  /**
>   * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
>   * @pages:  array of pages to be maybe marked dirty, and definitely released.
> @@ -237,10 +327,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  	}
>  
>  	page = vm_normal_page(vma, address, pte);
> -	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
> +	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
>  		/*
> -		 * Only return device mapping pages in the FOLL_GET case since
> -		 * they are only valid while holding the pgmap reference.
> +		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
> +		 * case since they are only valid while holding the pgmap
> +		 * reference.
>  		 */
>  		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
>  		if (*pgmap)
> @@ -283,6 +374,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  			page = ERR_PTR(-ENOMEM);
>  			goto out;
>  		}
> +	} else if (flags & FOLL_PIN) {
> +		if (unlikely(!try_pin_page(page))) {
> +			page = ERR_PTR(-ENOMEM);
> +			goto out;
> +		}

Use grab_page() here?

>  	}
>  	if (flags & FOLL_TOUCH) {
>  		if ((flags & FOLL_WRITE) &&
> @@ -1890,9 +2000,15 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  		page = pte_page(pte);
>  
> -		head = try_get_compound_head(page, 1);
> -		if (!head)
> -			goto pte_unmap;
> +		if (flags & FOLL_PIN) {
> +			head = page;
> +			if (unlikely(!try_pin_page(head)))
> +				goto pte_unmap;
> +		} else {
> +			head = try_get_compound_head(page, 1);
> +			if (!head)
> +				goto pte_unmap;
> +		}

Why don't you use grab_page() here? Also you seem to loose the head =
compound_head(page) indirection here for the FOLL_PIN case?

>  
>  		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
>  			put_page(head);
> @@ -1946,12 +2062,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>  
>  		pgmap = get_dev_pagemap(pfn, pgmap);
>  		if (unlikely(!pgmap)) {
> -			undo_dev_pagemap(nr, nr_start, pages);
> +			undo_dev_pagemap(nr, nr_start, flags, pages);
>  			return 0;
>  		}
>  		SetPageReferenced(page);
>  		pages[*nr] = page;
> -		get_page(page);
> +
> +		if (flags & FOLL_PIN) {
> +			if (unlikely(!try_pin_page(page))) {
> +				undo_dev_pagemap(nr, nr_start, flags, pages);
> +				return 0;
> +			}
> +		} else
> +			get_page(page);
> +

Use grab_page() here?

>  		(*nr)++;
>  		pfn++;
>  	} while (addr += PAGE_SIZE, addr != end);
...
> @@ -2025,12 +2149,31 @@ static int __record_subpages(struct page *page, unsigned long addr,
>  	return nr;
>  }
>  
> -static void put_compound_head(struct page *page, int refs)
> +static bool grab_compound_head(struct page *head, int refs, unsigned int flags)
>  {
> +	if (flags & FOLL_PIN) {
> +		if (unlikely(!try_pin_compound_head(head, refs)))
> +			return false;
> +	} else {
> +		head = try_get_compound_head(head, refs);
> +		if (!head)
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +static void put_compound_head(struct page *page, int refs, unsigned int flags)
> +{
> +	struct page *head = compound_head(page);
> +
> +	if (flags & FOLL_PIN)
> +		refs *= GUP_PIN_COUNTING_BIAS;
> +
>  	/* Do a get_page() first, in case refs == page->_refcount */
> -	get_page(page);
> -	page_ref_sub(page, refs);
> -	put_page(page);
> +	get_page(head);
> +	page_ref_sub(head, refs);
> +	put_page(head);
>  }
>  
>  #ifdef CONFIG_ARCH_HAS_HUGEPD
> @@ -2064,14 +2207,13 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>  
>  	head = pte_page(pte);
>  	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> -	refs = __record_subpages(page, addr, end, pages + *nr);
> +	refs = record_subpages(page, addr, end, pages + *nr);
>  
> -	head = try_get_compound_head(head, refs);
> -	if (!head)
> +	if (!grab_compound_head(head, refs, flags))

Are you sure this is correct? Historically we seem to have always had logic
like:

	head = compound_head(pte_page / pmd_page / ... (orig))

in this code. And you removed this now. Looking at the code I'm not sure
whether the compound_head() indirection is really needed or not. We seem to
have already huge page head in the page table but maybe there's some subtle
case I'm missing. So I'd be calmer if we left the head=compound_head(...)
in the code but if you really want to remove it, I'd like to see Ack from
someone actually familiar with huge pages - e.g. Kirill Shutemov...

And even if we find out that compound_head() indirection isn't really
needed, that is big enough change in the logic that it would deserve to be
done in a separate patch (if only for debugging by bisection purposes).

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 13cc93785006..981a9ea0b83f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
...
> @@ -5034,8 +5052,20 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  	pte = huge_ptep_get((pte_t *)pmd);
>  	if (pte_present(pte)) {
>  		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> +
>  		if (flags & FOLL_GET)
>  			get_page(page);
> +		else if (flags & FOLL_PIN) {
> +			/*
> +			 * try_pin_page() is not actually expected to fail
> +			 * here because we hold the ptl.
> +			 */
> +			if (unlikely(!try_pin_page(page))) {
> +				WARN_ON_ONCE(1);
> +				page = NULL;
> +				goto out;
> +			}
> +		}

Use grab_page() here?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply index

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-21  7:13 [PATCH v7 00/24] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
2019-11-21  7:13 ` [PATCH v7 01/24] mm/gup: pass flags arg to __gup_device_* functions John Hubbard
2019-11-21  8:06   ` Christoph Hellwig
2019-11-21  8:25     ` John Hubbard
2019-11-21  7:13 ` [PATCH v7 02/24] mm/gup: factor out duplicate code from four routines John Hubbard
2019-11-21  8:03   ` Christoph Hellwig
2019-11-21  8:29     ` John Hubbard
2019-11-21  9:49       ` Jan Kara
2019-11-21 21:47         ` John Hubbard
2019-11-21  9:54       ` Jan Kara
2019-11-22  2:54         ` John Hubbard
2019-11-22 11:15           ` Jan Kara
2019-11-21  7:13 ` [PATCH v7 03/24] mm/gup: move try_get_compound_head() to top, fix minor issues John Hubbard
2019-11-21  8:07   ` Christoph Hellwig
2019-11-21  7:13 ` [PATCH v7 04/24] mm: Cleanup __put_devmap_managed_page() vs ->page_free() John Hubbard
2019-11-21  8:04   ` Christoph Hellwig
2019-11-21  7:13 ` [PATCH v7 05/24] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages John Hubbard
2019-11-21  8:05   ` Christoph Hellwig
2019-11-21  8:54     ` John Hubbard
2019-11-21 16:59       ` Dan Williams
2019-11-21 22:22         ` John Hubbard
2019-11-21  7:13 ` [PATCH v7 06/24] goldish_pipe: rename local pin_user_pages() routine John Hubbard
2019-11-21  8:08   ` Christoph Hellwig
2019-11-21  8:36     ` John Hubbard
2019-11-21  7:13 ` [PATCH v7 07/24] IB/umem: use get_user_pages_fast() to pin DMA pages John Hubbard
2019-11-21  8:07   ` Christoph Hellwig
2019-11-21 14:36     ` Jason Gunthorpe
2019-11-24  6:14       ` John Hubbard
2019-11-24 10:07       ` Leon Romanovsky
2019-11-25  0:05         ` John Hubbard
2019-11-25  0:53           ` Jason Gunthorpe
2019-11-21  7:13 ` [PATCH v7 08/24] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
2019-11-21  8:09   ` Christoph Hellwig
2019-11-21  7:13 ` [PATCH v7 09/24] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM John Hubbard
2019-11-21  8:10   ` Christoph Hellwig
2019-11-21  8:48     ` John Hubbard
2019-11-21 21:35   ` Alex Williamson
2019-11-21 21:49     ` John Hubbard
2019-11-21  7:13 ` [PATCH v7 10/24] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
2019-11-21  7:13 ` [PATCH v7 11/24] goldish_pipe: convert to pin_user_pages() and put_user_page() John Hubbard
2019-11-21  7:13 ` [PATCH v7 12/24] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP John Hubbard
2019-11-21  7:13 ` [PATCH v7 13/24] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
2019-11-21  7:13 ` [PATCH v7 14/24] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
2019-11-21  7:13 ` [PATCH v7 15/24] fs/io_uring: set FOLL_PIN via pin_user_pages() John Hubbard
2019-11-21  7:13 ` [PATCH v7 16/24] net/xdp: " John Hubbard
2019-11-21  7:13 ` [PATCH v7 17/24] mm/gup: track FOLL_PIN pages John Hubbard
2019-11-21  9:39   ` Jan Kara [this message]
2019-11-30 18:59   ` kbuild test robot
2019-11-21  7:13 ` [PATCH v7 18/24] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
2019-11-21  7:13 ` [PATCH v7 19/24] vfio, mm: " John Hubbard
2019-11-21  7:13 ` [PATCH v7 20/24] powerpc: book3s64: convert to pin_user_pages() and put_user_page() John Hubbard
2019-11-21  7:13 ` [PATCH v7 21/24] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" John Hubbard
2019-11-21  7:13 ` [PATCH v7 22/24] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
2019-11-21  7:13 ` [PATCH v7 23/24] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
2019-11-21  7:13 ` [PATCH v7 24/24] mm, tree-wide: rename put_user_page*() to unpin_user_page*() John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191121093941.GA18190@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=airlied@linux.ie \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=benh@kernel.crashing.org \
    --cc=bjorn.topel@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=daniel@ffwll.ch \
    --cc=davem@davemloft.net \
    --cc=david@fromorbit.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hch@infradead.org \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=magnus.karlsson@intel.com \
    --cc=mchehab@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=mpe@ellerman.id.au \
    --cc=netdev@vger.kernel.org \
    --cc=paulus@samba.org \
    --cc=shuah@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git