nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Joao Martins <joao.m.martins@oracle.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: linux-mm@kvack.org, Dan Williams <dan.j.williams@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Matthew Wilcox <willy@infradead.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Jane Chu <jane.chu@oracle.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jonathan Corbet <corbet@lwn.net>, Christoph Hellwig <hch@lst.de>,
	nvdimm@lists.linux.dev, linux-doc@vger.kernel.org
Subject: Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
Date: Thu, 14 Oct 2021 18:56:51 +0100	[thread overview]
Message-ID: <9a9dccc8-81b0-f3dd-60ea-130406791e64@oracle.com> (raw)
In-Reply-To: <20211013194345.GO2744544@nvidia.com>

On 10/13/21 20:43, Jason Gunthorpe wrote:
> On Wed, Oct 13, 2021 at 08:18:08PM +0100, Joao Martins wrote:
>> On 10/13/21 18:41, Jason Gunthorpe wrote:
>>> On Mon, Oct 11, 2021 at 04:53:29PM +0100, Joao Martins wrote:
>>>> On 10/8/21 12:54, Jason Gunthorpe wrote:
>>>>> The only optimization that might work here is to grab the head, then
>>>>> compute the extent of tail pages and amalgamate them. Holding a ref on
>>>>> the head also secures the tails.
>>>> How about pmd_page(orig) / pud_page(orig) like what the rest of hugetlb/thp
>>>> checks do? i.e. we would pass pmd_page(orig)/pud_page(orig) to __gup_device_huge()
>>>> as an added @head argument. While keeping the same structure of counting tail pages
>>>> between @addr .. @end if we have a head page.
>>> The right logic is what everything else does:
>>> 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>>> 	refs = record_subpages(page, addr, end, pages + *nr);
>>> 	head = try_grab_compound_head(pud_page(orig), refs, flags);
>>> If you can use this, or not, depends entirely on answering the
>>> question of why does  __gup_device_huge() exist at all.
>> So for device-dax it seems to be an untackled oversight[*], probably
>> inherited from when fsdax devmap was introduced. What I don't know
>> is the other devmap users :(
> devmap generic infrastructure waits until all page refcounts go to
> zero, and it should wait until any fast GUP is serialized as part of
> the TLB shootdown - otherwise it is leaking access to the memory it
> controls beyond it's shutdown
> So, I don't think the different devmap users can break this?
maybe fsdax may not honor that after removing the pgmap ref count
as we talked earlier in the thread without the memory failures stuff.

But my point earlier on "oversight" was about the fact that we describe
PMD/PUDs with base pages. And hence why we can't do this:

 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 	head = try_grab_compound_head(pud_page(orig), refs, flags);

... For devmap.

>>> This I don't fully know:
>>> 1) As discussed quite a few times now, the entire get_dev_pagemap
>>>    stuff looks usless and should just be deleted. If you care about
>>>    optimizing this I would persue doing that as it will give the
>>>    biggest single win.
>> I am not questioning the well-deserved improvement -- but from a pure
>> optimization perspective the get_dev_pagemap() cost is not
>> visible and quite negligeble.
> You are doing large enough GUPs then that the expensive xarray seach
> is small compared to the rest?

The tests I showed are on 16G and 128G (I usually test with > 1Tb).
But we are comparing 1 pgmap lookup, and 512 refcount atomic updates for
a PMD sized pin (2M).

It depends on what you think small is. Maybe for a 4K-16K of memory pinned,
probably the pgmap lookup is more expensive.

A couple months ago I remember measuring xarray lookups and that would be
around ~150ns per lookup with few entries excluding the ref update. I can
be pedantic and give you a more concrete measurement for get_dev_pagemap()
but quite honestly don't think that it will be close with we pin a hugepage.

But this is orthogonal to the pgmap ref existence, I was merely commenting on
the performance aspect.

>>> 2) It breaks up the PUD/PMD into tail pages and scans them all
>>>    Why? Can devmap combine multiple compound_head's into the same PTE?
>> I am not aware of any other usage of compound pages for devmap struct pages
>> than this series. At least I haven't seen device-dax or fsdax using
>> this.
> Let me ask this question differently, is this assertion OK?
I understood the question. I was more trying to learn from you  on HMM/P2PDMA
use-cases as you sounded like compound pages exist elsewhere in devmap :)

> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -808,8 +808,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>         }
>         entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
> -       if (pfn_t_devmap(pfn))
> +       if (pfn_t_devmap(pfn)) {
> +               struct page *pfn_to_page(pfn);
> +
> +               WARN_ON(compound_head(page) != page);
> +               WARN_ON(compound_order(page) != PMD_SHIFT);
>                 entry = pmd_mkdevmap(entry);
> +       }
>         if (write) {
>                 entry = pmd_mkyoung(pmd_mkdirty(entry));
>                 entry = maybe_pmd_mkwrite(entry, vma);
> (and same for insert_pfn_pud)
> You said it is OK for device/dax/device.c?
Yes, correct. (I also verified with similar snippet above to be

I would like to emphasize that it is only OK *after this series*.

Without this series there is neither compound order or head, nor any
compound page whatsoever.

> And not for fs/dax.c?
IIUC, Correct. fs/dax code is a little fuzzy on the sector to PFN
translation but still:

There's no compound pages in fsdax (neither do I add them). So
compound_order() doesn't exist (hence order 0) and there's no head
which also means that compound_head(@page) returns @page (which is
a tad misleading on being a head definition as a PMD subpage would
not return @page).

>> Unless HMM does this stuff, or some sort of devmap page migration? P2PDMA
>> doesn't seem to be (yet?) caught by any of the GUP path at least before
>> Logan's series lands. Or am I misunderstanding things here?
> Of the places that call the insert_pfn_pmd/pud call chains I only see
> device/dax/device.c and fs/dax.c as being linked to devmap. So other
> devmap users don't use this stuff.
>> I was changing __gup_device_huge() with similar to the above, but yeah
>> it follows that algorithm as inside your do { } while() (thanks!). I can
>> turn __gup_device_huge() into another (renamed to like try_grab_pages())
>> helper and replace the callsites of gup_huge_{pud,pmd} for the THP/hugetlbfs
>> equivalent handling.
> I suppose it should be some #define because the (pmd_val != orig) logic
> is not sharable
I wasn't going to have that pmd_val() in there. Just this refcount part:

 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 	head = try_grab_compound_head(pud_page(orig), refs, flags);

and __gup_device_huge() melded in.

> But, yes, a general call that the walker should make at any level to
> record a pfn -> npages range efficiently.
>> I think the right answer is "depends on the devmap" type. device-dax with
>> PMD/PUDs (i.e. 2m pagesize PMEM or 1G pagesize pmem) works with the same
>> rules as hugetlbfs. fsdax not so much (as you say above) but it would
>> follow up changes to perhaps adhere to similar scheme (not exactly sure
>> how do deal with holes). HMM I am not sure what the rules are there.
>> P2PDMA seems not applicable?
> I would say, not "depends on the devmap", but what are the rules for
> calling insert_pfn_pmd in the first place.
I am gonna translate that into HMM and P2PDMA so far aren't affected as you
also said earlier above. Even after Logan's P2PDMA NVME series it appears is
always PTEs. So fsdax and device-dax are the only pmd_devmap users we
should care, and for pud_devmap() only device-dax.

> If users are allowed the create pmds that span many compound_head's
> then we need logic as I showed. Otherwise we do not.
/me nods.

> And I would document this relationship in the GUP side "This do/while
> is required because insert_pfn_pmd/pud() is used with compound pages
> smaller than the PUD/PMD size" so it isn't so confused with just
> "devmap"

Also, it's not that PMDs span compound heads, it's that PMDs/PUDs use
just base pages. Compound pages/head in devmap are only introduced by
series and for device-dax only.

  reply	other threads:[~2021-10-14 17:57 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
2021-08-27 14:58 ` [PATCH v4 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-08-27 14:58 ` [PATCH v4 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-08-27 14:58 ` [PATCH v4 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-08-27 14:58 ` [PATCH v4 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-08-27 15:33   ` Christoph Hellwig
2021-08-27 16:00     ` Joao Martins
2021-09-01  9:44       ` Christoph Hellwig
2021-09-09  9:38         ` Joao Martins
2021-08-27 14:58 ` [PATCH v4 05/14] device-dax: use ALIGN() for determining pgoff Joao Martins
2021-08-27 14:58 ` [PATCH v4 06/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
2021-11-05  0:31   ` Dan Williams
2021-11-05 12:09     ` Joao Martins
2021-11-05 16:14       ` Joao Martins
2021-11-05 16:46       ` Dan Williams
2021-11-05 18:11         ` Joao Martins
2021-08-27 14:58 ` [PATCH v4 07/14] device-dax: compound devmap support Joao Martins
2021-11-05  0:38   ` Dan Williams
2021-11-05 14:10     ` Joao Martins
2021-11-05 16:41       ` Dan Williams
2021-08-27 14:58 ` [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-08-27 16:25   ` Jason Gunthorpe
2021-08-27 18:34     ` Joao Martins
2021-08-30 13:07       ` Jason Gunthorpe
2021-08-31 12:34         ` Joao Martins
2021-08-31 17:05           ` Jason Gunthorpe
2021-09-23 16:51             ` Joao Martins
2021-09-28 18:01               ` Jason Gunthorpe
2021-09-29 11:50                 ` Joao Martins
2021-09-29 19:34                   ` Jason Gunthorpe
2021-09-30  3:01                     ` Alistair Popple
2021-09-30 17:54                       ` Joao Martins
2021-09-30 21:55                         ` Jason Gunthorpe
2021-10-18 18:36                       ` Jason Gunthorpe
2021-10-18 18:37                   ` Jason Gunthorpe
2021-10-08 11:54   ` Jason Gunthorpe
2021-10-11 15:53     ` Joao Martins
2021-10-13 17:41       ` Jason Gunthorpe
2021-10-13 19:18         ` Joao Martins
2021-10-13 19:43           ` Jason Gunthorpe
2021-10-14 17:56             ` Joao Martins [this message]
2021-10-14 18:06               ` Jason Gunthorpe
2021-08-27 14:58 ` [PATCH v4 09/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2021-08-27 14:58 ` [PATCH v4 10/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
2021-08-27 14:58 ` [PATCH v4 11/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
2021-08-27 14:58 ` [PATCH v4 12/14] mm/sparse-vmemmap: populate compound devmaps Joao Martins
2021-08-27 14:58 ` [PATCH v4 13/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
2021-08-27 14:58 ` [PATCH v4 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9a9dccc8-81b0-f3dd-60ea-130406791e64@oracle.com \
    --to=joao.m.martins@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=hch@lst.de \
    --cc=jane.chu@oracle.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=songmuchun@bytedance.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).