From: John Hubbard <jhubbard@nvidia.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
"Al Viro" <viro@zeniv.linux.org.uk>,
"Christoph Hellwig" <hch@infradead.org>,
"Dan Williams" <dan.j.williams@intel.com>,
"Dave Chinner" <david@fromorbit.com>,
"Ira Weiny" <ira.weiny@intel.com>, "Jan Kara" <jack@suse.cz>,
"Jason Gunthorpe" <jgg@ziepe.ca>,
"Jonathan Corbet" <corbet@lwn.net>,
"Jérôme Glisse" <jglisse@redhat.com>,
"Michal Hocko" <mhocko@suse.com>,
"Mike Kravetz" <mike.kravetz@oracle.com>,
"Shuah Khan" <shuah@kernel.org>,
"Vlastimil Babka" <vbabka@suse.cz>,
linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kselftest@vger.kernel.org, linux-rdma@vger.kernel.org,
linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH v2 4/8] mm/gup: track FOLL_PIN pages
Date: Wed, 29 Jan 2020 22:44:50 -0800 [thread overview]
Message-ID: <0be743df-e9af-6da9-c593-9e25ab194acf@nvidia.com> (raw)
In-Reply-To: <20200129135153.knie7ptvsxcgube6@box>
On 1/29/20 5:51 AM, Kirill A. Shutemov wrote:
> On Tue, Jan 28, 2020 at 07:24:13PM -0800, John Hubbard wrote:
>> Add tracking of pages that were pinned via FOLL_PIN. This tracking is
>> implemented via overloading of page->_refcount: pins are added by
>> adding GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a
>> fuzzy indication of pinning, and it can have false positives (and that's
>> OK). Please see the pre-existing
>> Documentation/core-api/pin_user_pages.rst for details.
>>
>> As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
>> (typically via pin_user_pages*()) are required to ultimately free such
>> pages via unpin_user_page().
>>
>> Please also not the limitation, discussed in pin_user_pages.rst under
>
> s/not/note/
Fixed, thanks!
...
>>
>> +/**
>> + * page_dma_pinned() - report if a page is pinned for DMA.
>> + *
>> + * This function checks if a page has been pinned via a call to
>> + * pin_user_pages*().
>> + *
>> + * For non-huge pages, the return value is partially fuzzy: false is not fuzzy,
>> + * because it means "definitely not pinned for DMA", but true means "probably
>> + * pinned for DMA, but possibly a false positive due to having at least
>> + * GUP_PIN_COUNTING_BIAS worth of normal page references".
>> + *
>> + * False positives are OK, because: a) it's unlikely for a page to get that many
>> + * refcounts, and b) all the callers of this routine are expected to be able to
>> + * deal gracefully with a false positive.
>
> I wounder if we should reverse the logic and name -- page_not_dma_pinned()
> or something -- too emphasise that we can only know for sure when the page
> is not pinned, but not necessary when it is.
>
This is an interesting point. I agree that it's worth maybe adding information
into the function name, but I'd like to keep the bool "positive", because there
will be a number of callers that ask "if it is possibly dma-pinned, then ...".
So combining that, how about this function name:
page_maybe_dma_pinned()
, which I could live with and I think would be acceptable?
>> + *
>> + * For more information, please see Documentation/vm/pin_user_pages.rst.
>> + *
>> + * @page: pointer to page to be queried.
>> + * @Return: True, if it is likely that the page has been "dma-pinned".
>> + * False, if the page is definitely not dma-pinned.
>> + */
>> +static inline bool page_dma_pinned(struct page *page)
>> +{
>> + /*
>> + * page_ref_count() is signed. If that refcount overflows, then
>> + * page_ref_count() returns a negative value, and callers will avoid
>> + * further incrementing the refcount.
>> + *
>> + * Here, for that overflow case, use the signed bit to count a little
>> + * bit higher via unsigned math, and thus still get an accurate result
>> + * from page_dma_pinned().
>> + */
>> + return ((unsigned int)page_ref_count(compound_head(page))) >=
>> + GUP_PIN_COUNTING_BIAS;
>
> Do you expect it too be called on tail pages?
Yes, we definitely cannot rule that out.
>
>> +}
>> +
>> #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>> #define SECTION_IN_PAGE_FLAGS
>> #endif
>> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
>> index 14d14beb1f7f..b9cbe553d1e7 100644
>> --- a/include/linux/page_ref.h
>> +++ b/include/linux/page_ref.h
>> @@ -102,6 +102,16 @@ static inline void page_ref_sub(struct page *page, int nr)
>> __page_ref_mod(page, -nr);
>> }
>>
>> +static inline int page_ref_sub_return(struct page *page, int nr)
>> +{
>> + int ret = atomic_sub_return(nr, &page->_refcount);
>> +
>> + if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
>> + __page_ref_mod(page, -nr);
>> +
>> + return ret;
>> +}
>> +
>
> I see opportunity to split the patch further.
ah, OK. I wasn't sure how far to go before I get tagged for "excessive
patch splitting"! haha. Anyway, are you suggesting to put the
page_ref_sub_return() routine into it's own patch?
Another thing to split out would be adding the flags to the remaining
functions, such as undo_dev_pagemap(). That burns quite a few lines of
diff. Anything else to split out?
>
>> static inline void page_ref_inc(struct page *page)
>> {
>> atomic_inc(&page->_refcount);
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 9e117998274c..7a96490dcc54 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -44,6 +44,136 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
>> return head;
>> }
>>
>> +/*
>> + * try_grab_compound_head() - attempt to elevate a page's refcount, by a
>> + * flags-dependent amount.
>> + *
>> + * "grab" names in this file mean, "look at flags to decide whether to use
>> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
>> + *
>> + * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
>> + * same time. (That's true throughout the get_user_pages*() and
>> + * pin_user_pages*() APIs.) Cases:
>> + *
>> + * FOLL_GET: page's refcount will be incremented by 1.
>> + * FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
>> + *
>> + * Return: head page (with refcount appropriately incremented) for success, or
>> + * NULL upon failure. If neither FOLL_GET nor FOLL_PIN was set, that's
>> + * considered failure, and furthermore, a likely bug in the caller, so a warning
>> + * is also emitted.
>> + */
>> +static __maybe_unused struct page *try_grab_compound_head(struct page *page,
>> + int refs,
>> + unsigned int flags)
>> +{
>> + if (flags & FOLL_GET)
>> + return try_get_compound_head(page, refs);
>> + else if (flags & FOLL_PIN) {
>> + refs *= GUP_PIN_COUNTING_BIAS;
>> + return try_get_compound_head(page, refs);
>
> Maybe overflow detection? At least under VM_BUG_ON()?
OK, yes I see now that there is no check to see if we're about to overflow
the refs, in this path. I'll add one.
...
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 0a55dec68925..b1079aaa6f24 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -958,6 +958,11 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>> */
>> WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
>>
>> + /* FOLL_GET and FOLL_PIN are mutually exclusive. */
>> + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
>> + (FOLL_PIN | FOLL_GET)))
>
> Too many parentheses.
OK, I'll remove at least one. :)
>
>> + return NULL;
>> +
>> if (flags & FOLL_WRITE && !pmd_write(*pmd))
>> return NULL;
>>
>> @@ -973,7 +978,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>> * device mapped pages can only be returned if the
>> * caller will manage the page reference count.
>> */
>> - if (!(flags & FOLL_GET))
>> + if (!(flags & (FOLL_GET | FOLL_PIN)))
>> return ERR_PTR(-EEXIST);
>>
>> pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
>> @@ -981,7 +986,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>> if (!*pgmap)
>> return ERR_PTR(-EFAULT);
>> page = pfn_to_page(pfn);
>> - get_page(page);
>> + if (!try_grab_page(page, flags))
>> + page = ERR_PTR(-ENOMEM);
>>
>> return page;
>> }
>> @@ -1101,6 +1107,11 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>> if (flags & FOLL_WRITE && !pud_write(*pud))
>> return NULL;
>>
>> + /* FOLL_GET and FOLL_PIN are mutually exclusive. */
>> + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
>> + (FOLL_PIN | FOLL_GET)))
>> + return NULL;
>> +
>
> Ditto.
ACK.
...
>> @@ -4965,6 +4958,12 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>> struct page *page = NULL;
>> spinlock_t *ptl;
>> pte_t pte;
>> +
>> + /* FOLL_GET and FOLL_PIN are mutually exclusive. */
>> + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
>> + (FOLL_PIN | FOLL_GET)))
>> + return NULL;
>> +
>
> Ditto.
ACK.
thanks,
--
John Hubbard
NVIDIA
next prev parent reply other threads:[~2020-01-30 6:47 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-29 3:24 [PATCH v2 0/8] mm/gup: track FOLL_PIN pages (follow on from v12) John Hubbard
2020-01-29 3:24 ` [PATCH v2 1/8] mm: dump_page: print head page's refcount, for compound pages John Hubbard
2020-01-29 11:25 ` Kirill A. Shutemov
2020-01-29 22:26 ` John Hubbard
2020-01-29 22:59 ` Matthew Wilcox
2020-01-30 6:23 ` John Hubbard
2020-01-30 6:30 ` John Hubbard
2020-01-29 3:24 ` [PATCH v2 2/8] mm/gup: split get_user_pages_remote() into two routines John Hubbard
2020-01-29 3:24 ` [PATCH v2 3/8] mm/gup: pass a flags arg to __gup_device_* functions John Hubbard
2020-01-29 3:24 ` [PATCH v2 4/8] mm/gup: track FOLL_PIN pages John Hubbard
2020-01-29 13:51 ` Kirill A. Shutemov
2020-01-30 6:44 ` John Hubbard [this message]
2020-01-30 11:31 ` Kirill A. Shutemov
2020-01-31 3:19 ` John Hubbard
2020-01-29 3:24 ` [PATCH v2 5/8] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages John Hubbard
2020-01-29 3:24 ` [PATCH v2 6/8] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting John Hubbard
2020-01-29 3:24 ` [PATCH v2 7/8] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
2020-01-29 3:24 ` [PATCH v2 8/8] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0be743df-e9af-6da9-c593-9e25ab194acf@nvidia.com \
--to=jhubbard@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=ira.weiny@intel.com \
--cc=jack@suse.cz \
--cc=jgg@ziepe.ca \
--cc=jglisse@redhat.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kirill@shutemov.name \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mhocko@suse.com \
--cc=mike.kravetz@oracle.com \
--cc=shuah@kernel.org \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).