All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joao Martins <joao.m.martins@oracle.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Linux MM <linux-mm@kvack.org>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Matthew Wilcox <willy@infradead.org>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Muchun Song <songmuchun@bytedance.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
Date: Mon, 22 Feb 2021 14:32:04 +0000	[thread overview]
Message-ID: <872eec38-3c18-72dd-c5c6-147c02ae33d1@oracle.com> (raw)
In-Reply-To: <ca888299-c576-567b-e6c3-5df3bcd8ca51@oracle.com>



On 2/22/21 11:06 AM, Joao Martins wrote:
> On 2/20/21 1:18 AM, Dan Williams wrote:
>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>
>>> The link above describes it quite nicely, but the idea is to reuse tail
>>> page vmemmap areas, particular the area which only describes tail pages.
>>> So a vmemmap page describes 64 struct pages, and the first page for a given
>>> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
>>> vmemmap page would contain only tail pages, and that's what gets reused across
>>> the rest of the subsection/section. The bigger the page size, the bigger the
>>> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
>>>
>>> In terms of savings, per 1Tb of memory, the struct page cost would go down
>>> with compound pagemap:
>>>
>>> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
>>> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
>>
>> Nice!
>>
> 
> I failed to mention this in the cover letter but I should say that with this trick we will
> need to build the vmemmap page tables with basepages for 2M align, as opposed to hugepages
> in the vmemmap page tables (as you probably could tell from the patches). 

Also to be clear: by "we will need to build the vmemmap page tables with basepages for 2M
align" I strictly refer to the ZONE_DEVICE range we are mapping the struct pages. It's not
the enterity of the vmemmap!

> This means that
> we have to allocate a PMD page, and that costs 2GB per 1Tb (as opposed to 4M). This is
> fixable for 1G align by reusing PMD pages (albeit I haven't done that in this RFC series).
> 
> The footprint reduction is still big, so to iterate the numbers above (and I will fix this
> in the v2 cover letter):
> 
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
> 
> For vmemmap page tables, we need to use base pages for 2M pages. So taking that into
> account, in this RFC series:
> 
> * with 2M pages we lose 6G instead of 16G (0.586% instead of 1.5% of total memory)
> * with 1G pages we lose ~2GB instead of 16G (0.19% instead of 1.5% of total memory)
> 
> For 1G align, we are able to reuse vmemmap PMDs that only point to tail pages, so
> ultimately we can get the page table overhead from 2GB to 12MB:
> 
> * with 1G pages we lose 20MB instead of 16G (0.0019% instead of 1.5% of total memory)
> 
>>>
>>> The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
>>> user. For unpin_user_pages() we have an additional test to demonstrate the
>>> improvement.  The test performs MR reg/unreg continuously and measuring its
>>> rate for a given period. So essentially ib_mem_get and ib_mem_release being
>>> stress tested which at the end of day means: pin_user_pages_longterm() and
>>> unpin_user_pages() for a scatterlist:
>>>
>>>     Before:
>>>     159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
>>>     466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
>>>
>>>     After:
>>>     305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
>>>     1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)
>>
>> Why does hugetlbfs get faster for a ZONE_DEVICE change? Might answer
>> that question myself when I get to patch 8.
>>
> Because the unpinning improvements aren't ZONE_DEVICE specific.
> 
> FWIW, I moved those two offending patches outside of this series:
> 
>   https://lore.kernel.org/linux-mm/20210212130843.13865-1-joao.m.martins@oracle.com/
> 
>>>
>>> Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
>>> is very experimental, and I imported most of follow_hugetlb_page(), except
>>> that we do the same trick as gup-fast. In doing the patch I feel this batching
>>> should live in follow_page_mask() and having that being changed to return a set
>>> of pages/something-else when walking over PMD/PUDs for THP / devmap pages. This
>>> patch then brings the previous test of mr reg/unreg (above) on parity
>>> between device-dax and hugetlbfs.
>>>
>>> Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we are
>>> still running tests. Hence the RFC, asking for comments and general direction
>>> of the work before continuing.
>>
>> Will go look at the code, but I don't see anything scary conceptually
>> here. The fact that pfn_to_page() does not need to change is among the
>> most compelling features of this approach.
>>
> Glad to hear that :D
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

WARNING: multiple messages have this Message-ID (diff)
From: Joao Martins <joao.m.martins@oracle.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Linux MM <linux-mm@kvack.org>, Ira Weiny <ira.weiny@intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Matthew Wilcox <willy@infradead.org>,
	Jason Gunthorpe <jgg@ziepe.ca>, Jane Chu <jane.chu@oracle.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
Date: Mon, 22 Feb 2021 14:32:04 +0000	[thread overview]
Message-ID: <872eec38-3c18-72dd-c5c6-147c02ae33d1@oracle.com> (raw)
In-Reply-To: <ca888299-c576-567b-e6c3-5df3bcd8ca51@oracle.com>



On 2/22/21 11:06 AM, Joao Martins wrote:
> On 2/20/21 1:18 AM, Dan Williams wrote:
>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>
>>> The link above describes it quite nicely, but the idea is to reuse tail
>>> page vmemmap areas, particular the area which only describes tail pages.
>>> So a vmemmap page describes 64 struct pages, and the first page for a given
>>> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
>>> vmemmap page would contain only tail pages, and that's what gets reused across
>>> the rest of the subsection/section. The bigger the page size, the bigger the
>>> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
>>>
>>> In terms of savings, per 1Tb of memory, the struct page cost would go down
>>> with compound pagemap:
>>>
>>> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
>>> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
>>
>> Nice!
>>
> 
> I failed to mention this in the cover letter but I should say that with this trick we will
> need to build the vmemmap page tables with basepages for 2M align, as opposed to hugepages
> in the vmemmap page tables (as you probably could tell from the patches). 

Also to be clear: by "we will need to build the vmemmap page tables with basepages for 2M
align" I strictly refer to the ZONE_DEVICE range we are mapping the struct pages. It's not
the enterity of the vmemmap!

> This means that
> we have to allocate a PMD page, and that costs 2GB per 1Tb (as opposed to 4M). This is
> fixable for 1G align by reusing PMD pages (albeit I haven't done that in this RFC series).
> 
> The footprint reduction is still big, so to iterate the numbers above (and I will fix this
> in the v2 cover letter):
> 
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
> 
> For vmemmap page tables, we need to use base pages for 2M pages. So taking that into
> account, in this RFC series:
> 
> * with 2M pages we lose 6G instead of 16G (0.586% instead of 1.5% of total memory)
> * with 1G pages we lose ~2GB instead of 16G (0.19% instead of 1.5% of total memory)
> 
> For 1G align, we are able to reuse vmemmap PMDs that only point to tail pages, so
> ultimately we can get the page table overhead from 2GB to 12MB:
> 
> * with 1G pages we lose 20MB instead of 16G (0.0019% instead of 1.5% of total memory)
> 
>>>
>>> The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
>>> user. For unpin_user_pages() we have an additional test to demonstrate the
>>> improvement.  The test performs MR reg/unreg continuously and measuring its
>>> rate for a given period. So essentially ib_mem_get and ib_mem_release being
>>> stress tested which at the end of day means: pin_user_pages_longterm() and
>>> unpin_user_pages() for a scatterlist:
>>>
>>>     Before:
>>>     159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
>>>     466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
>>>
>>>     After:
>>>     305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
>>>     1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)
>>
>> Why does hugetlbfs get faster for a ZONE_DEVICE change? Might answer
>> that question myself when I get to patch 8.
>>
> Because the unpinning improvements aren't ZONE_DEVICE specific.
> 
> FWIW, I moved those two offending patches outside of this series:
> 
>   https://lore.kernel.org/linux-mm/20210212130843.13865-1-joao.m.martins@oracle.com/
> 
>>>
>>> Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
>>> is very experimental, and I imported most of follow_hugetlb_page(), except
>>> that we do the same trick as gup-fast. In doing the patch I feel this batching
>>> should live in follow_page_mask() and having that being changed to return a set
>>> of pages/something-else when walking over PMD/PUDs for THP / devmap pages. This
>>> patch then brings the previous test of mr reg/unreg (above) on parity
>>> between device-dax and hugetlbfs.
>>>
>>> Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we are
>>> still running tests. Hence the RFC, asking for comments and general direction
>>> of the work before continuing.
>>
>> Will go look at the code, but I don't see anything scary conceptually
>> here. The fact that pfn_to_page() does not need to change is among the
>> most compelling features of this approach.
>>
> Glad to hear that :D
> 


  reply	other threads:[~2021-02-22 14:32 UTC|newest]

Thread overview: 147+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2020-12-08 17:28 ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages Joao Martins
2020-12-08 17:28   ` Joao Martins
2020-12-09  5:59   ` John Hubbard
2020-12-09  5:59     ` John Hubbard
2020-12-09  6:33     ` Matthew Wilcox
2020-12-09  6:33       ` Matthew Wilcox
2020-12-09 13:12       ` Joao Martins
2020-12-09 13:12         ` Joao Martins
2021-02-20  1:43     ` Dan Williams
2021-02-20  1:43       ` Dan Williams
2021-02-22 11:24       ` Joao Martins
2021-02-22 11:24         ` Joao Martins
2021-02-22 20:37         ` Dan Williams
2021-02-22 20:37           ` Dan Williams
2021-02-23 15:46           ` Joao Martins
2021-02-23 15:46             ` Joao Martins
2021-02-23 16:50             ` Dan Williams
2021-02-23 16:50               ` Dan Williams
2021-02-23 17:18               ` Joao Martins
2021-02-23 17:18                 ` Joao Martins
2021-02-23 18:18                 ` Dan Williams
2021-02-23 18:18                   ` Dan Williams
2021-03-10 18:12           ` Joao Martins
2021-03-10 18:12             ` Joao Martins
2021-03-12  5:54             ` Dan Williams
2021-03-12  5:54               ` Dan Williams
2021-02-20  1:24   ` Dan Williams
2021-02-20  1:24     ` Dan Williams
2021-02-22 11:09     ` Joao Martins
2021-02-22 11:09       ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate Joao Martins
2020-12-08 17:28   ` Joao Martins
2020-12-09  6:16   ` John Hubbard
2020-12-09  6:16     ` John Hubbard
2020-12-09 13:51     ` Joao Martins
2020-12-09 13:51       ` Joao Martins
2021-02-20  1:49   ` Dan Williams
2021-02-20  1:49     ` Dan Williams
2021-02-22 11:26     ` Joao Martins
2021-02-22 11:26       ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given mhp_params::align Joao Martins
2020-12-08 17:28   ` Joao Martins
2020-12-08 17:38   ` Joao Martins
2020-12-08 17:38     ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size Joao Martins
2020-12-08 17:28   ` Joao Martins
2021-02-20  3:34   ` Dan Williams
2021-02-20  3:34     ` Dan Williams
2021-02-22 11:42     ` Joao Martins
2021-02-22 11:42       ` Joao Martins
2021-02-22 22:40       ` Dan Williams
2021-02-22 22:40         ` Dan Williams
2021-02-23 15:46         ` Joao Martins
2021-02-23 15:46           ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 4/9] mm/page_alloc: Reuse tail struct pages for compound pagemaps Joao Martins
2020-12-08 17:28   ` Joao Martins
2021-02-20  6:17   ` Dan Williams
2021-02-20  6:17     ` Dan Williams
2021-02-22 12:01     ` Joao Martins
2021-02-22 12:01       ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 5/9] device-dax: Compound pagemap support Joao Martins
2020-12-08 17:28   ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages Joao Martins
2020-12-08 17:28   ` Joao Martins
2020-12-08 19:49   ` Jason Gunthorpe
2020-12-09 11:05     ` Joao Martins
2020-12-09 11:05       ` Joao Martins
2020-12-09 15:15       ` Jason Gunthorpe
2020-12-09 16:02         ` Joao Martins
2020-12-09 16:02           ` Joao Martins
2020-12-09 16:24           ` Jason Gunthorpe
2020-12-09 17:27             ` Joao Martins
2020-12-09 17:27               ` Joao Martins
2020-12-09 18:14             ` Matthew Wilcox
2020-12-09 18:14               ` Matthew Wilcox
2020-12-09 19:08               ` Jason Gunthorpe
2020-12-10 15:43               ` Joao Martins
2020-12-10 15:43                 ` Joao Martins
2020-12-09  4:40   ` John Hubbard
2020-12-09  4:40     ` John Hubbard
2020-12-09 13:44     ` Joao Martins
2020-12-09 13:44       ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 7/9] mm/gup: Decrement head page " Joao Martins
2020-12-08 17:28   ` Joao Martins
2020-12-08 19:34   ` Jason Gunthorpe
2020-12-09  5:06     ` John Hubbard
2020-12-09  5:06       ` John Hubbard
2020-12-09 13:43       ` Jason Gunthorpe
2020-12-09 12:17     ` Joao Martins
2020-12-09 12:17       ` Joao Martins
2020-12-17 19:05     ` Joao Martins
2020-12-17 19:05       ` Joao Martins
2020-12-17 20:05       ` Jason Gunthorpe
2020-12-17 22:34         ` Joao Martins
2020-12-17 22:34           ` Joao Martins
2020-12-18 14:25           ` Jason Gunthorpe
2020-12-19  2:06         ` John Hubbard
2020-12-19  2:06           ` John Hubbard
2020-12-19 13:10           ` Joao Martins
2020-12-19 13:10             ` Joao Martins
2020-12-08 17:29 ` [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release() Joao Martins
2020-12-08 17:29   ` Joao Martins
2020-12-08 19:29   ` Jason Gunthorpe
2020-12-09 10:59     ` Joao Martins
2020-12-09 10:59       ` Joao Martins
2020-12-19 13:15       ` Joao Martins
2020-12-19 13:15         ` Joao Martins
2020-12-09  5:18   ` John Hubbard
2020-12-09  5:18     ` John Hubbard
2020-12-08 17:29 ` [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas Joao Martins
2020-12-08 17:29   ` Joao Martins
2020-12-08 19:57   ` Jason Gunthorpe
2020-12-09  8:05     ` Christoph Hellwig
2020-12-09  8:05       ` Christoph Hellwig
2020-12-09 11:19     ` Joao Martins
2020-12-09 11:19       ` Joao Martins
2020-12-09  5:23   ` John Hubbard
2020-12-09  5:23     ` John Hubbard
2020-12-09  9:38 ` [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps David Hildenbrand
2020-12-09  9:38   ` David Hildenbrand
2020-12-09  9:52 ` [External] " Muchun Song
2020-12-09  9:52   ` Muchun Song
2021-02-20  1:18 ` Dan Williams
2021-02-20  1:18   ` Dan Williams
2021-02-22 11:06   ` Joao Martins
2021-02-22 11:06     ` Joao Martins
2021-02-22 14:32     ` Joao Martins [this message]
2021-02-22 14:32       ` Joao Martins
2021-02-23 16:28   ` Joao Martins
2021-02-23 16:28     ` Joao Martins
2021-02-23 16:44     ` Dan Williams
2021-02-23 16:44       ` Dan Williams
2021-02-23 17:15       ` Joao Martins
2021-02-23 17:15         ` Joao Martins
2021-02-23 18:15         ` Dan Williams
2021-02-23 18:15           ` Dan Williams
2021-02-23 18:54       ` Jason Gunthorpe
2021-02-23 22:48         ` Dan Williams
2021-02-23 22:48           ` Dan Williams
2021-02-23 23:07           ` Jason Gunthorpe
2021-02-24  0:14             ` Dan Williams
2021-02-24  0:14               ` Dan Williams
2021-02-24  1:00               ` Jason Gunthorpe
2021-02-24  1:32                 ` Dan Williams
2021-02-24  1:32                   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=872eec38-3c18-72dd-c5c6-147c02ae33d1@oracle.com \
    --to=joao.m.martins@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=mike.kravetz@oracle.com \
    --cc=songmuchun@bytedance.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.