From: Jan Kara <jack@suse.cz>
To: John Hubbard <jhubbard@nvidia.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
"Al Viro" <viro@zeniv.linux.org.uk>,
"Christoph Hellwig" <hch@infradead.org>,
"Dan Williams" <dan.j.williams@intel.com>,
"Dave Chinner" <david@fromorbit.com>,
"Ira Weiny" <ira.weiny@intel.com>, "Jan Kara" <jack@suse.cz>,
"Jason Gunthorpe" <jgg@ziepe.ca>,
"Jonathan Corbet" <corbet@lwn.net>,
"Jérôme Glisse" <jglisse@redhat.com>,
"Kirill A . Shutemov" <kirill@shutemov.name>,
"Michal Hocko" <mhocko@suse.com>,
"Mike Kravetz" <mike.kravetz@oracle.com>,
"Shuah Khan" <shuah@kernel.org>,
"Vlastimil Babka" <vbabka@suse.cz>,
"Matthew Wilcox" <willy@infradead.org>,
linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kselftest@vger.kernel.org, linux-rdma@vger.kernel.org,
linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v3 08/12] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
Date: Mon, 3 Feb 2020 15:43:24 +0100 [thread overview]
Message-ID: <20200203144324.GF18591@quack2.suse.cz> (raw)
In-Reply-To: <20200201034029.4063170-9-jhubbard@nvidia.com>
On Fri 31-01-20 19:40:25, John Hubbard wrote:
> For huge pages (and in fact, any compound page), the
> GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail
> page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS
> (1024). That limits the number of huge pages that can be pinned.
>
> This patch removes that limitation, by using an exact form of pin
> counting for compound pages of order > 1. The "order > 1" is required
> because this approach uses the 3rd struct page in the compound page, and
> order 1 compound pages only have two pages, so that won't work there.
>
> A new struct page field, hpage_pinned_refcount, has been added,
> replacing a padding field in the union (so no new space is used).
>
> This enhancement also has a useful side effect: huge pages and compound
> pages (of order > 1) do not suffer from the "potential false positives"
> problem that is discussed in the page_dma_pinned() comment block. That
> is because these compound pages have extra space for tracking things, so
> they get exact pin counts instead of overloading page->_refcount.
>
> Documentation/core-api/pin_user_pages.rst is updated accordingly.
>
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
The patch looks good to me. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> Documentation/core-api/pin_user_pages.rst | 40 +++++-------
> include/linux/mm.h | 26 ++++++++
> include/linux/mm_types.h | 7 +-
> mm/gup.c | 78 ++++++++++++++++++++---
> mm/hugetlb.c | 6 ++
> mm/page_alloc.c | 2 +
> mm/rmap.c | 6 ++
> 7 files changed, 133 insertions(+), 32 deletions(-)
>
> diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
> index 9829345428f8..3f72b1ea1104 100644
> --- a/Documentation/core-api/pin_user_pages.rst
> +++ b/Documentation/core-api/pin_user_pages.rst
> @@ -52,8 +52,22 @@ Which flags are set by each wrapper
>
> For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
> flags the caller provides. The caller is required to pass in a non-null struct
> -pages* array, and the function then pin pages by incrementing each by a special
> -value. For now, that value is +1, just like get_user_pages*().::
> +pages* array, and the function then pins pages by incrementing each by a special
> +value: GUP_PIN_COUNTING_BIAS.
> +
> +For huge pages (and in fact, any compound page of more than 2 pages), the
> +GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting
> +is achieved, by using the 3rd struct page in the compound page. A new struct
> +page field, hpage_pinned_refcount, has been added in order to support this.
> +
> +This approach for compound pages avoids the counting upper limit problems that
> +are discussed below. Those limitations would have been aggravated severely by
> +huge pages, because each tail page adds a refcount to the head page. And in
> +fact, testing revealed that, without a separate hpage_pinned_refcount field,
> +page overflows were seen in some huge page stress tests.
> +
> +This also means that huge pages and compound pages (of order > 1) do not suffer
> +from the false positives problem that is mentioned below.::
>
> Function
> --------
> @@ -99,27 +113,6 @@ pages:
> This also leads to limitations: there are only 31-10==21 bits available for a
> counter that increments 10 bits at a time.
>
> -TODO: for 1GB and larger huge pages, this is cutting it close. That's because
> -when pin_user_pages() follows such pages, it increments the head page by "1"
> -(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
> -pin_user_pages()) for each tail page. So if you have a 1GB huge page:
> -
> -* There are 256K (18 bits) worth of 4 KB tail pages.
> -* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
> - 10 bits at a time)
> -* There are 21 - 18 == 3 bits available to count. Except that there aren't,
> - because you need to allow for a few normal get_page() calls on the head page,
> - as well. Fortunately, the approach of using addition, rather than "hard"
> - bitfields, within page->_refcount, allows for sharing these bits gracefully.
> - But we're still looking at about 8 references.
> -
> -This, however, is a missing feature more than anything else, because it's easily
> -solved by addressing an obvious inefficiency in the original get_user_pages()
> -approach of retrieving pages: stop treating all the pages as if they were
> -PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
> -this, so some work is required. Once that's in place, this limitation mostly
> -disappears from view, because there will be ample refcounting range available.
> -
> * Callers must specifically request "dma-pinned tracking of pages". In other
> words, just calling get_user_pages() will not suffice; a new set of functions,
> pin_user_page() and related, must be used.
> @@ -228,5 +221,6 @@ References
> * `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
> * `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
> * `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
> +* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`
>
> John Hubbard, October, 2019
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ca787c606f0e..fdcd137b9981 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -770,6 +770,24 @@ static inline unsigned int compound_order(struct page *page)
> return page[1].compound_order;
> }
>
> +static inline bool hpage_pincount_available(struct page *page)
> +{
> + /*
> + * Can the page->hpage_pinned_refcount field be used? That field is in
> + * the 3rd page of the compound page, so the smallest (2-page) compound
> + * pages cannot support it.
> + */
> + page = compound_head(page);
> + return PageCompound(page) && compound_order(page) > 1;
> +}
> +
> +static inline int compound_pincount(struct page *page)
> +{
> + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
> + page = compound_head(page);
> + return atomic_read(compound_pincount_ptr(page));
> +}
> +
> static inline void set_compound_order(struct page *page, unsigned int order)
> {
> page[1].compound_order = order;
> @@ -1084,6 +1102,11 @@ void unpin_user_pages(struct page **pages, unsigned long npages);
> * refcounts, and b) all the callers of this routine are expected to be able to
> * deal gracefully with a false positive.
> *
> + * For huge pages, the result will be exactly correct. That's because we have
> + * more tracking data available: the 3rd struct page in the compound page is
> + * used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS
> + * scheme).
> + *
> * For more information, please see Documentation/vm/pin_user_pages.rst.
> *
> * @page: pointer to page to be queried.
> @@ -1092,6 +1115,9 @@ void unpin_user_pages(struct page **pages, unsigned long npages);
> */
> static inline bool page_maybe_dma_pinned(struct page *page)
> {
> + if (hpage_pincount_available(page))
> + return compound_pincount(page) > 0;
> +
> /*
> * page_ref_count() is signed. If that refcount overflows, then
> * page_ref_count() returns a negative value, and callers will avoid
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e87bb864bdb2..01e9717b8529 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -137,7 +137,7 @@ struct page {
> };
> struct { /* Second tail page of compound page */
> unsigned long _compound_pad_1; /* compound_head */
> - unsigned long _compound_pad_2;
> + atomic_t hpage_pinned_refcount;
> /* For both global and memcg */
> struct list_head deferred_list;
> };
> @@ -226,6 +226,11 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page)
> return &page[1].compound_mapcount;
> }
>
> +static inline atomic_t *compound_pincount_ptr(struct page *page)
> +{
> + return &page[2].hpage_pinned_refcount;
> +}
> +
> /*
> * Used for sizing the vmemmap region on some architectures
> */
> diff --git a/mm/gup.c b/mm/gup.c
> index 6e8b773c233a..c10d0d051c5b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -29,6 +29,22 @@ struct follow_page_context {
> unsigned int page_mask;
> };
>
> +static void hpage_pincount_add(struct page *page, int refs)
> +{
> + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
> + VM_BUG_ON_PAGE(page != compound_head(page), page);
> +
> + atomic_add(refs, compound_pincount_ptr(page));
> +}
> +
> +static void hpage_pincount_sub(struct page *page, int refs)
> +{
> + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
> + VM_BUG_ON_PAGE(page != compound_head(page), page);
> +
> + atomic_sub(refs, compound_pincount_ptr(page));
> +}
> +
> /*
> * Return the compound head page with ref appropriately incremented,
> * or NULL if that failed.
> @@ -70,8 +86,25 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
> if (flags & FOLL_GET)
> return try_get_compound_head(page, refs);
> else if (flags & FOLL_PIN) {
> - refs *= GUP_PIN_COUNTING_BIAS;
> - return try_get_compound_head(page, refs);
> + /*
> + * When pinning a compound page of order > 1 (which is what
> + * hpage_pincount_available() checks for), use an exact count to
> + * track it, via hpage_pincount_add/_sub().
> + *
> + * However, be sure to *also* increment the normal page refcount
> + * field at least once, so that the page really is pinned.
> + */
> + if (!hpage_pincount_available(page))
> + refs *= GUP_PIN_COUNTING_BIAS;
> +
> + page = try_get_compound_head(page, refs);
> + if (!page)
> + return NULL;
> +
> + if (hpage_pincount_available(page))
> + hpage_pincount_add(page, refs);
> +
> + return page;
> }
>
> WARN_ON_ONCE(1);
> @@ -106,12 +139,25 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
> if (flags & FOLL_GET)
> return try_get_page(page);
> else if (flags & FOLL_PIN) {
> + int refs = 1;
> +
> page = compound_head(page);
>
> if (WARN_ON_ONCE(page_ref_count(page) <= 0))
> return false;
>
> - page_ref_add(page, GUP_PIN_COUNTING_BIAS);
> + if (hpage_pincount_available(page))
> + hpage_pincount_add(page, 1);
> + else
> + refs = GUP_PIN_COUNTING_BIAS;
> +
> + /*
> + * Similar to try_grab_compound_head(): even if using the
> + * hpage_pincount_add/_sub() routines, be sure to
> + * *also* increment the normal page refcount field at least
> + * once, so that the page really is pinned.
> + */
> + page_ref_add(page, refs);
> }
>
> return true;
> @@ -120,12 +166,17 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
> #ifdef CONFIG_DEV_PAGEMAP_OPS
> static bool __unpin_devmap_managed_user_page(struct page *page)
> {
> - int count;
> + int count, refs = 1;
>
> if (!page_is_devmap_managed(page))
> return false;
>
> - count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
> + if (hpage_pincount_available(page))
> + hpage_pincount_sub(page, 1);
> + else
> + refs = GUP_PIN_COUNTING_BIAS;
> +
> + count = page_ref_sub_return(page, refs);
>
> /*
> * devmap page refcounts are 1-based, rather than 0-based: if
> @@ -157,6 +208,8 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
> */
> void unpin_user_page(struct page *page)
> {
> + int refs = 1;
> +
> page = compound_head(page);
>
> /*
> @@ -168,7 +221,12 @@ void unpin_user_page(struct page *page)
> if (__unpin_devmap_managed_user_page(page))
> return;
>
> - if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
> + if (hpage_pincount_available(page))
> + hpage_pincount_sub(page, 1);
> + else
> + refs = GUP_PIN_COUNTING_BIAS;
> +
> + if (page_ref_sub_and_test(page, refs))
> __put_page(page);
> }
> EXPORT_SYMBOL(unpin_user_page);
> @@ -2200,8 +2258,12 @@ static int record_subpages(struct page *page, unsigned long addr,
>
> static void put_compound_head(struct page *page, int refs, unsigned int flags)
> {
> - if (flags & FOLL_PIN)
> - refs *= GUP_PIN_COUNTING_BIAS;
> + if (flags & FOLL_PIN) {
> + if (hpage_pincount_available(page))
> + hpage_pincount_sub(page, refs);
> + else
> + refs *= GUP_PIN_COUNTING_BIAS;
> + }
>
> VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
> /*
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 487e998fd38e..07059d936f7b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1009,6 +1009,9 @@ static void destroy_compound_gigantic_page(struct page *page,
> struct page *p = page + 1;
>
> atomic_set(compound_mapcount_ptr(page), 0);
> + if (hpage_pincount_available(page))
> + atomic_set(compound_pincount_ptr(page), 0);
> +
> for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> clear_compound_head(p);
> set_page_refcounted(p);
> @@ -1287,6 +1290,9 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order)
> set_compound_head(p, page);
> }
> atomic_set(compound_mapcount_ptr(page), -1);
> +
> + if (hpage_pincount_available(page))
> + atomic_set(compound_pincount_ptr(page), 0);
> }
>
> /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 15e908ad933b..c205b912f108 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -689,6 +689,8 @@ void prep_compound_page(struct page *page, unsigned int order)
> set_compound_head(p, page);
> }
> atomic_set(compound_mapcount_ptr(page), -1);
> + if (hpage_pincount_available(page))
> + atomic_set(compound_pincount_ptr(page), 0);
> }
>
> #ifdef CONFIG_DEBUG_PAGEALLOC
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b3e381919835..e45b9b991e2f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1178,6 +1178,9 @@ void page_add_new_anon_rmap(struct page *page,
> VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> /* increment count (starts at -1) */
> atomic_set(compound_mapcount_ptr(page), 0);
> + if (hpage_pincount_available(page))
> + atomic_set(compound_pincount_ptr(page), 0);
> +
> __inc_node_page_state(page, NR_ANON_THPS);
> } else {
> /* Anon THP always mapped first with PMD */
> @@ -1974,6 +1977,9 @@ void hugepage_add_new_anon_rmap(struct page *page,
> {
> BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> atomic_set(compound_mapcount_ptr(page), 0);
> + if (hpage_pincount_available(page))
> + atomic_set(compound_pincount_ptr(page), 0);
> +
> __page_set_anon_rmap(page, vma, address, 1);
> }
> #endif /* CONFIG_HUGETLB_PAGE */
> --
> 2.25.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2020-02-03 14:43 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-01 3:40 [PATCH v3 00/12] mm/gup: track FOLL_PIN pages John Hubbard
2020-02-01 3:40 ` [PATCH v3 01/12] mm: dump_page(): better diagnostics for compound pages John Hubbard
2020-02-03 13:16 ` Kirill A. Shutemov
2020-02-03 19:51 ` John Hubbard
2020-02-01 3:40 ` [PATCH v3 02/12] mm/gup: split get_user_pages_remote() into two routines John Hubbard
2020-02-03 13:17 ` Kirill A. Shutemov
2020-02-03 14:20 ` Jan Kara
2020-02-03 21:09 ` John Hubbard
2020-02-01 3:40 ` [PATCH v3 03/12] mm/gup: pass a flags arg to __gup_device_* functions John Hubbard
2020-02-03 13:19 ` Kirill A. Shutemov
2020-02-03 19:56 ` John Hubbard
2020-02-01 3:40 ` [PATCH v3 04/12] mm: introduce page_ref_sub_return() John Hubbard
2020-02-03 13:23 ` Kirill A. Shutemov
2020-02-03 20:03 ` John Hubbard
2020-02-01 3:40 ` [PATCH v3 05/12] mm/gup: pass gup flags to two more routines John Hubbard
2020-02-03 13:24 ` Kirill A. Shutemov
2020-02-03 14:18 ` Jan Kara
2020-02-01 3:40 ` [PATCH v3 06/12] mm/gup: require FOLL_GET for get_user_pages_fast() John Hubbard
2020-02-03 13:26 ` Kirill A. Shutemov
2020-02-03 14:18 ` Jan Kara
2020-02-01 3:40 ` [PATCH v3 07/12] mm/gup: track FOLL_PIN pages John Hubbard
2020-02-03 13:40 ` Kirill A. Shutemov
2020-02-03 21:01 ` John Hubbard
2020-02-03 14:29 ` Jan Kara
2020-02-01 3:40 ` [PATCH v3 08/12] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages John Hubbard
2020-02-03 13:45 ` Kirill A. Shutemov
2020-02-03 14:43 ` Jan Kara [this message]
2020-02-01 3:40 ` [PATCH v3 09/12] mm: dump_page(): better diagnostics for huge pinned pages John Hubbard
2020-02-03 13:46 ` Kirill A. Shutemov
2020-02-03 14:44 ` Jan Kara
2020-02-01 3:40 ` [PATCH v3 10/12] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting John Hubbard
2020-02-03 13:53 ` Kirill A. Shutemov
2020-02-03 21:04 ` John Hubbard
2020-02-03 21:30 ` Kirill A. Shutemov
2020-02-03 21:34 ` John Hubbard
2020-02-03 23:16 ` John Hubbard
2020-02-03 23:43 ` John Hubbard
2020-02-01 3:40 ` [PATCH v3 11/12] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
2020-02-03 13:58 ` Kirill A. Shutemov
2020-02-03 21:17 ` John Hubbard
2020-02-03 21:55 ` Kirill A. Shutemov
2020-02-03 22:07 ` John Hubbard
2020-02-01 3:40 ` [PATCH v3 12/12] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200203144324.GF18591@quack2.suse.cz \
--to=jack@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=ira.weiny@intel.com \
--cc=jgg@ziepe.ca \
--cc=jglisse@redhat.com \
--cc=jhubbard@nvidia.com \
--cc=kirill@shutemov.name \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mhocko@suse.com \
--cc=mike.kravetz@oracle.com \
--cc=shuah@kernel.org \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).