All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joao Martins <joao.m.martins@oracle.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: linux-mm@kvack.org, Dan Williams <dan.j.williams@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Matthew Wilcox <willy@infradead.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Jane Chu <jane.chu@oracle.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jonathan Corbet <corbet@lwn.net>, Christoph Hellwig <hch@lst.de>,
	nvdimm@lists.linux.dev, linux-doc@vger.kernel.org
Subject: Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
Date: Wed, 13 Oct 2021 20:18:08 +0100	[thread overview]
Message-ID: <20e2472f-736a-be3a-6f3a-5dfcb85f4cfb@oracle.com> (raw)
In-Reply-To: <20211013174140.GJ2744544@nvidia.com>

On 10/13/21 18:41, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 04:53:29PM +0100, Joao Martins wrote:
>> On 10/8/21 12:54, Jason Gunthorpe wrote:
> 
>>> The only optimization that might work here is to grab the head, then
>>> compute the extent of tail pages and amalgamate them. Holding a ref on
>>> the head also secures the tails.
>>
>> How about pmd_page(orig) / pud_page(orig) like what the rest of hugetlb/thp
>> checks do? i.e. we would pass pmd_page(orig)/pud_page(orig) to __gup_device_huge()
>> as an added @head argument. While keeping the same structure of counting tail pages
>> between @addr .. @end if we have a head page.
> 
> The right logic is what everything else does:
> 
> 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> 	refs = record_subpages(page, addr, end, pages + *nr);
> 	head = try_grab_compound_head(pud_page(orig), refs, flags);
> 
> If you can use this, or not, depends entirely on answering the
> question of why does  __gup_device_huge() exist at all.
> 
So for device-dax it seems to be an untackled oversight[*], probably
inherited from when fsdax devmap was introduced. What I don't know
is the other devmap users :(

[*] it has all the same properties as hugetlbfs AFAIU (after this series)

Certainly if any devmap PMD/PUD was represented in a single compound
page like THP/hugetlbfs then this patch would be a matter of removing
pgmap ref grab (and nuke the __gup_device_huge function existence as I
suggested earlier).

> This I don't fully know:
> 
> 1) As discussed quite a few times now, the entire get_dev_pagemap
>    stuff looks usless and should just be deleted. If you care about
>    optimizing this I would persue doing that as it will give the
>    biggest single win.
> 
I am not questioning the well-deserved improvement -- but from a pure
optimization perspective the get_dev_pagemap() cost is not
visible and quite negligeble. It is done once and only once and
subsequent calls to get_dev_pagemap with a non-NULL pgmap don't alter
the refcount and just return the pgmap object. And the xarray storing
the ranges -> pgmap won't be that big ... perhaps maybe 12 pgmaps on
a large >1T pmem system depending on your DIMM size.

The refcount update of the individual 4K page is what introduces
a seriously prohibite cost: I am seeing 10x the cost with DRAM
located struct pages (pmem located struct pages is even more ludicrous).

> 2) It breaks up the PUD/PMD into tail pages and scans them all
>    Why? Can devmap combine multiple compound_head's into the same PTE?


I am not aware of any other usage of compound pages for devmap struct pages
than this series. At least I haven't seen device-dax or fsdax using this.
Unless HMM does this stuff, or some sort of devmap page migration? P2PDMA
doesn't seem to be (yet?) caught by any of the GUP path at least before
Logan's series lands. Or am I misunderstanding things here?

>    Does devmap guarentee that the PUD/PMD points to the head page? (I
>    assume no)
> 
For device-dax yes.


> But I'm looking at this some more and I see try_get_compound_head() is
> reading the compound_head with no locking, just READ_ONCE, so that
> must be OK under GUP.
> 
I suppose one other way to get around the double atomic op would be to fail
the try_get_compound_head() by comparing the first tail page compound_head()
after grabbing the head with @refs. That is instead of comparing against
grabbed head page and the passed page argument.

> It still seems to me the same generic algorithm should work
> everywhere, once we get rid of the get_dev_pagemap
> 
>   start_pfn = pud/pmd_pfn() + pud/pmd_page_offset(addr)
>   end_pfn = start_pfn + (end - addr) // fixme
>   if (THP)
>      refs = end_pfn - start_pfn
>   if (devmap)
>      refs = 1
> 
>   do {
>      page = pfn_to_page(start_pfn)
>      head_page = try_grab_compound_head(page, refs, flags)
>      if (pud/pmd_val() != orig)
>         err
> 
>      npages = 1 << compound_order(head_page)
>      npages = min(npages, end_pfn - start_pfn)
>      for (i = 0, iter = page; i != npages) {
>      	 *pages++ = iter;
>          mem_map_next(iter, page, i)
>      }
> 
>      if (devmap && npages > 2)
>          grab_compound_head(head_page, npages - 1, flags)
>      start_pfn += npages;
>   } while (start_pfn != end_pfn)
> 
> Above needs to be cleaned up quite a bit, but is the general idea.
> 
> There is an further optimization we can put in where we can know that
> 'page' is still in a currently grab'd compound (eg last_page+1 = page,
> not past compound_order) and defer the refcount work.
> 
I was changing __gup_device_huge() with similar to the above, but yeah
it follows that algorithm as inside your do { } while() (thanks!). I can
turn __gup_device_huge() into another (renamed to like try_grab_pages())
helper and replace the callsites of gup_huge_{pud,pmd} for the THP/hugetlbfs
equivalent handling.

>> It's interesting how THP (in gup_huge_pmd()) unilaterally computes
>> tails assuming pmd_page(orig) is the head page.
> 
> I think this is an integral property of THP, probably not devmap/dax
> though?

I think the right answer is "depends on the devmap" type. device-dax with
PMD/PUDs (i.e. 2m pagesize PMEM or 1G pagesize pmem) works with the same
rules as hugetlbfs. fsdax not so much (as you say above) but it would
follow up changes to perhaps adhere to similar scheme (not exactly sure
how do deal with holes). HMM I am not sure what the rules are there.
P2PDMA seems not applicable?

  reply	other threads:[~2021-10-13 19:19 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
2021-08-27 14:58 ` [PATCH v4 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-08-27 14:58 ` [PATCH v4 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-08-27 14:58 ` [PATCH v4 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-08-27 14:58 ` [PATCH v4 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-08-27 15:33   ` Christoph Hellwig
2021-08-27 16:00     ` Joao Martins
2021-09-01  9:44       ` Christoph Hellwig
2021-09-09  9:38         ` Joao Martins
2021-08-27 14:58 ` [PATCH v4 05/14] device-dax: use ALIGN() for determining pgoff Joao Martins
2021-08-27 14:58 ` [PATCH v4 06/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
2021-11-05  0:31   ` Dan Williams
2021-11-05 12:09     ` Joao Martins
2021-11-05 16:14       ` Joao Martins
2021-11-05 16:46       ` Dan Williams
2021-11-05 18:11         ` Joao Martins
2021-08-27 14:58 ` [PATCH v4 07/14] device-dax: compound devmap support Joao Martins
2021-11-05  0:38   ` Dan Williams
2021-11-05 14:10     ` Joao Martins
2021-11-05 16:41       ` Dan Williams
2021-08-27 14:58 ` [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-08-27 16:25   ` Jason Gunthorpe
2021-08-27 18:34     ` Joao Martins
2021-08-30 13:07       ` Jason Gunthorpe
2021-08-31 12:34         ` Joao Martins
2021-08-31 17:05           ` Jason Gunthorpe
2021-09-23 16:51             ` Joao Martins
2021-09-28 18:01               ` Jason Gunthorpe
2021-09-29 11:50                 ` Joao Martins
2021-09-29 19:34                   ` Jason Gunthorpe
2021-09-30  3:01                     ` Alistair Popple
2021-09-30 17:54                       ` Joao Martins
2021-09-30 21:55                         ` Jason Gunthorpe
2021-10-18 18:36                       ` Jason Gunthorpe
2021-10-18 18:37                   ` Jason Gunthorpe
2021-10-08 11:54   ` Jason Gunthorpe
2021-10-11 15:53     ` Joao Martins
2021-10-13 17:41       ` Jason Gunthorpe
2021-10-13 19:18         ` Joao Martins [this message]
2021-10-13 19:43           ` Jason Gunthorpe
2021-10-14 17:56             ` Joao Martins
2021-10-14 18:06               ` Jason Gunthorpe
2021-08-27 14:58 ` [PATCH v4 09/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2021-08-27 14:58 ` [PATCH v4 10/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
2021-08-27 14:58 ` [PATCH v4 11/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
2021-08-27 14:58 ` [PATCH v4 12/14] mm/sparse-vmemmap: populate compound devmaps Joao Martins
2021-08-27 14:58 ` [PATCH v4 13/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
2021-08-27 14:58 ` [PATCH v4 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20e2472f-736a-be3a-6f3a-5dfcb85f4cfb@oracle.com \
    --to=joao.m.martins@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=hch@lst.de \
    --cc=jane.chu@oracle.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=songmuchun@bytedance.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.