From: Joao Martins <joao.m.martins@oracle.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Linux MM <linux-mm@kvack.org>,
Vishal Verma <vishal.l.verma@intel.com>,
Dave Jiang <dave.jiang@intel.com>,
Naoya Horiguchi <naoya.horiguchi@nec.com>,
Matthew Wilcox <willy@infradead.org>,
Jason Gunthorpe <jgg@ziepe.ca>,
John Hubbard <jhubbard@nvidia.com>,
Jane Chu <jane.chu@oracle.com>,
Muchun Song <songmuchun@bytedance.com>,
Mike Kravetz <mike.kravetz@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>,
Jonathan Corbet <corbet@lwn.net>,
Linux NVDIMM <nvdimm@lists.linux.dev>,
Linux Doc Mailing List <linux-doc@vger.kernel.org>
Subject: Re: [PATCH v3 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry
Date: Wed, 28 Jul 2021 21:08:40 +0100 [thread overview]
Message-ID: <eac81d19-1bbc-6441-b9b4-12a8de041053@oracle.com> (raw)
In-Reply-To: <CAPcyv4jC9He7tnTnbiracHZ9P9XSWsH4pJMKFip6-nSbsBWyrg@mail.gmail.com>
On 7/28/21 9:03 PM, Dan Williams wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Currently, for compound PUD mappings, the implementation consumes 40MB
>> per TB but it can be optimized to 16MB per TB with the approach
>> detailed below.
>>
>> Right now basepages are used to populate the PUD tail pages, and it
>> picks the address of the previous page of the subsection that precedes
>> the memmap being initialized. This is done when a given memmap
>> address isn't aligned to the pgmap @geometry (which is safe to do because
>> @ranges are guaranteed to be aligned to @geometry).
>>
>> For pagemaps with an align which spans various sections, this means
>> that PMD pages are unnecessarily allocated for reusing the same tail
>> pages. Effectively, on x86 a PUD can span 8 sections (depending on
>> config), and a page is being allocated a page for the PMD to reuse
>> the tail vmemmap across the rest of the PTEs. In short effecitvely the
>> PMD cover the tail vmemmap areas all contain the same PFN. So instead
>> of doing this way, populate a new PMD on the second section of the
>> compound page (tail vmemmap PMD), and then the following sections
>> utilize the preceding PMD previously populated which only contain
>> tail pages).
>>
>> After this scheme for an 1GB pagemap aligned area, the first PMD
>> (section) would contain head page and 32767 tail pages, where the
>> second PMD contains the full 32768 tail pages. The latter page gets
>> its PMD reused across future section mapping of the same pagemap.
>>
>> Besides fewer pagetable entries allocated, keeping parity with
>> hugepages in the directmap (as done by vmemmap_populate_hugepages()),
>> this further increases savings per compound page. Rather than
>> requiring 8 PMD page allocations only need 2 (plus two base pages
>> allocated for head and tail areas for the first PMD). 2M pages still
>> require using base pages, though.
>
> This looks good to me now, modulo the tail_page helper discussed
> previously. Thanks for the diagram, makes it clearer what's happening.
>
> I don't see any red flags that would prevent a reviewed-by when you
> send the next spin.
>
Cool, thanks!
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> Documentation/vm/vmemmap_dedup.rst | 109 +++++++++++++++++++++++++++++
>> include/linux/mm.h | 3 +-
>> mm/sparse-vmemmap.c | 74 +++++++++++++++++---
>> 3 files changed, 174 insertions(+), 12 deletions(-)
>>
>> diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
>> index 42830a667c2a..96d9f5f0a497 100644
>> --- a/Documentation/vm/vmemmap_dedup.rst
>> +++ b/Documentation/vm/vmemmap_dedup.rst
>> @@ -189,3 +189,112 @@ at a later stage when we populate the sections.
>> It only use 3 page structs for storing all information as opposed
>> to 4 on HugeTLB pages. This does not affect memory savings between both.
>>
>> +Additionally, it further extends the tail page deduplication with 1GB
>> +device-dax compound pages.
>> +
>> +E.g.: A 1G device-dax page on x86_64 consists in 4096 page frames, split
>> +across 8 PMD page frames, with the first PMD having 2 PTE page frames.
>> +In total this represents a total of 40960 bytes per 1GB page.
>> +
>> +Here is how things look after the previously described tail page deduplication
>> +technique.
>> +
>> + device-dax page frames struct pages(4096 pages) page frame(2 pages)
>> + +-----------+ -> +----------+ --> +-----------+ mapping to +-------------+
>> + | | | 0 | | 0 | -------------> | 0 |
>> + | | +----------+ +-----------+ +-------------+
>> + | | | 1 | -------------> | 1 |
>> + | | +-----------+ +-------------+
>> + | | | 2 | ----------------^ ^ ^ ^ ^ ^ ^
>> + | | +-----------+ | | | | | |
>> + | | | 3 | ------------------+ | | | | |
>> + | | +-----------+ | | | | |
>> + | | | 4 | --------------------+ | | | |
>> + | PMD 0 | +-----------+ | | | |
>> + | | | 5 | ----------------------+ | | |
>> + | | +-----------+ | | |
>> + | | | .. | ------------------------+ | |
>> + | | +-----------+ | |
>> + | | | 511 | --------------------------+ |
>> + | | +-----------+ |
>> + | | |
>> + | | |
>> + | | |
>> + +-----------+ page frames |
>> + +-----------+ -> +----------+ --> +-----------+ mapping to |
>> + | | | 1 .. 7 | | 512 | ----------------------------+
>> + | | +----------+ +-----------+ |
>> + | | | .. | ----------------------------+
>> + | | +-----------+ |
>> + | | | .. | ----------------------------+
>> + | | +-----------+ |
>> + | | | .. | ----------------------------+
>> + | | +-----------+ |
>> + | | | .. | ----------------------------+
>> + | PMD | +-----------+ |
>> + | 1 .. 7 | | .. | ----------------------------+
>> + | | +-----------+ |
>> + | | | .. | ----------------------------+
>> + | | +-----------+ |
>> + | | | 4095 | ----------------------------+
>> + +-----------+ +-----------+
>> +
>> +Page frames of PMD 1 through 7 are allocated and mapped to the same PTE page frame
>> +that contains stores tail pages. As we can see in the diagram, PMDs 1 through 7
>> +all look like the same. Therefore we can map PMD 2 through 7 to PMD 1 page frame.
>> +This allows to free 6 vmemmap pages per 1GB page, decreasing the overhead per
>> +1GB page from 40960 bytes to 16384 bytes.
>> +
>> +Here is how things look after PMD tail page deduplication.
>> +
>> + device-dax page frames struct pages(4096 pages) page frame(2 pages)
>> + +-----------+ -> +----------+ --> +-----------+ mapping to +-------------+
>> + | | | 0 | | 0 | -------------> | 0 |
>> + | | +----------+ +-----------+ +-------------+
>> + | | | 1 | -------------> | 1 |
>> + | | +-----------+ +-------------+
>> + | | | 2 | ----------------^ ^ ^ ^ ^ ^ ^
>> + | | +-----------+ | | | | | |
>> + | | | 3 | ------------------+ | | | | |
>> + | | +-----------+ | | | | |
>> + | | | 4 | --------------------+ | | | |
>> + | PMD 0 | +-----------+ | | | |
>> + | | | 5 | ----------------------+ | | |
>> + | | +-----------+ | | |
>> + | | | .. | ------------------------+ | |
>> + | | +-----------+ | |
>> + | | | 511 | --------------------------+ |
>> + | | +-----------+ |
>> + | | |
>> + | | |
>> + | | |
>> + +-----------+ page frames |
>> + +-----------+ -> +----------+ --> +-----------+ mapping to |
>> + | | | 1 | | 512 | ----------------------------+
>> + | | +----------+ +-----------+ |
>> + | | ^ ^ ^ ^ ^ ^ | .. | ----------------------------+
>> + | | | | | | | | +-----------+ |
>> + | | | | | | | | | .. | ----------------------------+
>> + | | | | | | | | +-----------+ |
>> + | | | | | | | | | .. | ----------------------------+
>> + | | | | | | | | +-----------+ |
>> + | | | | | | | | | .. | ----------------------------+
>> + | PMD 1 | | | | | | | +-----------+ |
>> + | | | | | | | | | .. | ----------------------------+
>> + | | | | | | | | +-----------+ |
>> + | | | | | | | | | .. | ----------------------------+
>> + | | | | | | | | +-----------+ |
>> + | | | | | | | | | 4095 | ----------------------------+
>> + +-----------+ | | | | | | +-----------+
>> + | PMD 2 | ----+ | | | | |
>> + +-----------+ | | | | |
>> + | PMD 3 | ------+ | | | |
>> + +-----------+ | | | |
>> + | PMD 4 | --------+ | | |
>> + +-----------+ | | |
>> + | PMD 5 | ----------+ | |
>> + +-----------+ | |
>> + | PMD 6 | ------------+ |
>> + +-----------+ |
>> + | PMD 7 | --------------+
>> + +-----------+
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 5e3e153ddd3d..e9dc3e2de7be 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3088,7 +3088,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
>> pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>> p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>> pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>> -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>> +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
>> + struct page *block);
>> pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>> struct vmem_altmap *altmap, struct page *block);
>> void *vmemmap_alloc_block(unsigned long size, int node);
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index a8de6c472999..68041ca9a797 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -537,13 +537,22 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
>> return p;
>> }
>>
>> -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
>> +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
>> + struct page *block)
>> {
>> pmd_t *pmd = pmd_offset(pud, addr);
>> if (pmd_none(*pmd)) {
>> - void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>> - if (!p)
>> - return NULL;
>> + void *p;
>> +
>> + if (!block) {
>> + p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>> + if (!p)
>> + return NULL;
>> + } else {
>> + /* See comment in vmemmap_pte_populate(). */
>> + get_page(block);
>> + p = page_to_virt(block);
>> + }
>> pmd_populate_kernel(&init_mm, pmd, p);
>> }
>> return pmd;
>> @@ -585,15 +594,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>> return pgd;
>> }
>>
>> -static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> - struct vmem_altmap *altmap,
>> - struct page *reuse, struct page **page)
>> +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
>> + struct vmem_altmap *altmap,
>> + struct page *reuse, pmd_t **ptr)
>> {
>> pgd_t *pgd;
>> p4d_t *p4d;
>> pud_t *pud;
>> pmd_t *pmd;
>> - pte_t *pte;
>>
>> pgd = vmemmap_pgd_populate(addr, node);
>> if (!pgd)
>> @@ -604,9 +612,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> pud = vmemmap_pud_populate(p4d, addr, node);
>> if (!pud)
>> return -ENOMEM;
>> - pmd = vmemmap_pmd_populate(pud, addr, node);
>> + pmd = vmemmap_pmd_populate(pud, addr, node, reuse);
>> if (!pmd)
>> return -ENOMEM;
>> + if (ptr)
>> + *ptr = pmd;
>> + return 0;
>> +}
>> +
>> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> + struct vmem_altmap *altmap,
>> + struct page *reuse, struct page **page)
>> +{
>> + pmd_t *pmd;
>> + pte_t *pte;
>> +
>> + if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
>> + return -ENOMEM;
>> +
>> pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
>> if (!pte)
>> return -ENOMEM;
>> @@ -650,6 +673,20 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>> return vmemmap_populate_address(addr, node, NULL, NULL, page);
>> }
>>
>> +static int __meminit vmemmap_populate_pmd_range(unsigned long start,
>> + unsigned long end,
>> + int node, struct page *page)
>> +{
>> + unsigned long addr = start;
>> +
>> + for (; addr < end; addr += PMD_SIZE) {
>> + if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
>> + return -ENOMEM;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>> unsigned long start,
>> unsigned long end, int node,
>> @@ -670,6 +707,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>> offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>> if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
>> pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
>> + pmd_t *pmdp;
>> pte_t *ptep;
>>
>> addr = start - PAGE_SIZE;
>> @@ -681,11 +719,25 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>> * the previous struct pages are mapped when trying to lookup
>> * the last tail page.
>> */
>> - ptep = pte_offset_kernel(pmd_off_k(addr), addr);
>> - if (!ptep)
>> + pmdp = pmd_off_k(addr);
>> + if (!pmdp)
>> + return -ENOMEM;
>> +
>> + /*
>> + * Reuse the tail pages vmemmap pmd page
>> + * See layout diagram in Documentation/vm/vmemmap_dedup.rst
>> + */
>> + if (offset % pgmap_geometry(pgmap) > PFN_PHYS(PAGES_PER_SECTION))
>> + return vmemmap_populate_pmd_range(start, end, node,
>> + pmd_page(*pmdp));
>> +
>> + /* See comment above when pmd_off_k() is called. */
>> + ptep = pte_offset_kernel(pmdp, addr);
>> + if (pte_none(*ptep))
>> return -ENOMEM;
>>
>> /*
>> + * Populate the tail pages vmemmap pmd page.
>> * Reuse the page that was populated in the prior iteration
>> * with just tail struct pages.
>> */
>> --
>> 2.17.1
>>
next prev parent reply other threads:[~2021-07-28 20:09 UTC|newest]
Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2021-07-14 19:35 ` [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-07-15 0:17 ` Dan Williams
2021-07-15 2:51 ` [External] " Muchun Song
2021-07-15 6:40 ` Christoph Hellwig
2021-07-15 9:19 ` Muchun Song
2021-07-15 13:17 ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-07-15 0:19 ` Dan Williams
2021-07-15 2:53 ` [External] " Muchun Song
2021-07-15 13:17 ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-07-15 0:20 ` Dan Williams
2021-07-14 19:35 ` [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-07-15 1:08 ` Dan Williams
2021-07-15 12:52 ` Joao Martins
2021-07-15 13:06 ` Joao Martins
2021-07-15 19:48 ` Dan Williams
2021-07-30 16:13 ` Joao Martins
2021-07-22 0:38 ` Jane Chu
2021-07-22 10:56 ` Joao Martins
2021-07-15 12:59 ` Christoph Hellwig
2021-07-15 13:15 ` Joao Martins
2021-07-15 6:48 ` Christoph Hellwig
2021-07-15 13:15 ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2021-07-28 5:56 ` Dan Williams
2021-07-28 9:43 ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
2021-07-28 6:04 ` Dan Williams
2021-07-28 10:48 ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
2021-07-15 2:47 ` [External] " Muchun Song
2021-07-15 13:16 ` Joao Martins
2021-07-28 6:09 ` Dan Williams
2021-07-14 19:35 ` [PATCH v3 08/14] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
2021-07-28 6:55 ` Dan Williams
2021-07-28 15:35 ` Joao Martins
2021-07-28 18:03 ` Dan Williams
2021-07-28 18:54 ` Joao Martins
2021-07-28 20:04 ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 09/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
2021-07-28 7:28 ` Dan Williams
2021-07-28 15:56 ` Joao Martins
2021-07-28 16:08 ` Dan Williams
2021-07-28 16:12 ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 10/14] device-dax: use ALIGN() for determining pgoff Joao Martins
2021-07-28 7:29 ` Dan Williams
2021-07-28 15:56 ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
2021-07-28 7:30 ` Dan Williams
2021-07-28 15:56 ` Joao Martins
2021-08-06 12:28 ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 12/14] device-dax: compound pagemap support Joao Martins
2021-07-14 23:36 ` Dan Williams
2021-07-15 12:00 ` Joao Martins
2021-07-27 23:51 ` Dan Williams
2021-07-28 9:36 ` Joao Martins
2021-07-28 18:51 ` Dan Williams
2021-07-28 18:59 ` Joao Martins
2021-07-28 19:03 ` Dan Williams
2021-07-14 19:35 ` [PATCH v3 13/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-07-28 19:55 ` Dan Williams
2021-07-28 20:07 ` Joao Martins
2021-07-28 20:23 ` Dan Williams
2021-08-25 19:10 ` Joao Martins
2021-08-25 19:15 ` Matthew Wilcox
2021-08-25 19:26 ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins
2021-07-28 20:03 ` Dan Williams
2021-07-28 20:08 ` Joao Martins [this message]
2021-07-14 21:48 ` [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Andrew Morton
2021-07-14 23:47 ` Dan Williams
2021-07-22 2:24 ` Matthew Wilcox
2021-07-22 10:53 ` Joao Martins
2021-07-27 23:23 ` Dan Williams
2021-08-02 10:40 ` Joao Martins
2021-08-02 14:06 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=eac81d19-1bbc-6441-b9b4-12a8de041053@oracle.com \
--to=joao.m.martins@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=jane.chu@oracle.com \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=naoya.horiguchi@nec.com \
--cc=nvdimm@lists.linux.dev \
--cc=songmuchun@bytedance.com \
--cc=vishal.l.verma@intel.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).