nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Joao Martins <joao.m.martins@oracle.com>
Cc: Linux MM <linux-mm@kvack.org>,
	Vishal Verma <vishal.l.verma@intel.com>,
	 Dave Jiang <dave.jiang@intel.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	 Matthew Wilcox <willy@infradead.org>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	John Hubbard <jhubbard@nvidia.com>,
	 Jane Chu <jane.chu@oracle.com>,
	Muchun Song <songmuchun@bytedance.com>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Jonathan Corbet <corbet@lwn.net>,
	Linux NVDIMM <nvdimm@lists.linux.dev>,
	 Linux Doc Mailing List <linux-doc@vger.kernel.org>
Subject: Re: [PATCH v3 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry
Date: Wed, 28 Jul 2021 13:03:35 -0700	[thread overview]
Message-ID: <CAPcyv4jC9He7tnTnbiracHZ9P9XSWsH4pJMKFip6-nSbsBWyrg@mail.gmail.com> (raw)
In-Reply-To: <20210714193542.21857-15-joao.m.martins@oracle.com>

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Currently, for compound PUD mappings, the implementation consumes 40MB
> per TB but it can be optimized to 16MB per TB with the approach
> detailed below.
>
> Right now basepages are used to populate the PUD tail pages, and it
> picks the address of the previous page of the subsection that precedes
> the memmap being initialized.  This is done when a given memmap
> address isn't aligned to the pgmap @geometry (which is safe to do because
> @ranges are guaranteed to be aligned to @geometry).
>
> For pagemaps with an align which spans various sections, this means
> that PMD pages are unnecessarily allocated for reusing the same tail
> pages.  Effectively, on x86 a PUD can span 8 sections (depending on
> config), and a page is being  allocated a page for the PMD to reuse
> the tail vmemmap across the rest of the PTEs. In short effecitvely the
> PMD cover the tail vmemmap areas all contain the same PFN. So instead
> of doing this way, populate a new PMD on the second section of the
> compound page (tail vmemmap PMD), and then the following sections
> utilize the preceding PMD previously populated which only contain
> tail pages).
>
> After this scheme for an 1GB pagemap aligned area, the first PMD
> (section) would contain head page and 32767 tail pages, where the
> second PMD contains the full 32768 tail pages.  The latter page gets
> its PMD reused across future section mapping of the same pagemap.
>
> Besides fewer pagetable entries allocated, keeping parity with
> hugepages in the directmap (as done by vmemmap_populate_hugepages()),
> this further increases savings per compound page. Rather than
> requiring 8 PMD page allocations only need 2 (plus two base pages
> allocated for head and tail areas for the first PMD). 2M pages still
> require using base pages, though.

This looks good to me now, modulo the tail_page helper discussed
previously. Thanks for the diagram, makes it clearer what's happening.

I don't see any red flags that would prevent a reviewed-by when you
send the next spin.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  Documentation/vm/vmemmap_dedup.rst | 109 +++++++++++++++++++++++++++++
>  include/linux/mm.h                 |   3 +-
>  mm/sparse-vmemmap.c                |  74 +++++++++++++++++---
>  3 files changed, 174 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
> index 42830a667c2a..96d9f5f0a497 100644
> --- a/Documentation/vm/vmemmap_dedup.rst
> +++ b/Documentation/vm/vmemmap_dedup.rst
> @@ -189,3 +189,112 @@ at a later stage when we populate the sections.
>  It only use 3 page structs for storing all information as opposed
>  to 4 on HugeTLB pages. This does not affect memory savings between both.
>
> +Additionally, it further extends the tail page deduplication with 1GB
> +device-dax compound pages.
> +
> +E.g.: A 1G device-dax page on x86_64 consists in 4096 page frames, split
> +across 8 PMD page frames, with the first PMD having 2 PTE page frames.
> +In total this represents a total of 40960 bytes per 1GB page.
> +
> +Here is how things look after the previously described tail page deduplication
> +technique.
> +
> +   device-dax      page frames   struct pages(4096 pages)     page frame(2 pages)
> + +-----------+ -> +----------+ --> +-----------+   mapping to   +-------------+
> + |           |    |    0     |     |     0     | -------------> |      0      |
> + |           |    +----------+     +-----------+                +-------------+
> + |           |                     |     1     | -------------> |      1      |
> + |           |                     +-----------+                +-------------+
> + |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^ ^
> + |           |                     +-----------+                   | | | | | |
> + |           |                     |     3     | ------------------+ | | | | |
> + |           |                     +-----------+                     | | | | |
> + |           |                     |     4     | --------------------+ | | | |
> + |   PMD 0   |                     +-----------+                       | | | |
> + |           |                     |     5     | ----------------------+ | | |
> + |           |                     +-----------+                         | | |
> + |           |                     |     ..    | ------------------------+ | |
> + |           |                     +-----------+                           | |
> + |           |                     |     511   | --------------------------+ |
> + |           |                     +-----------+                             |
> + |           |                                                               |
> + |           |                                                               |
> + |           |                                                               |
> + +-----------+     page frames                                               |
> + +-----------+ -> +----------+ --> +-----------+    mapping to               |
> + |           |    |  1 .. 7  |     |    512    | ----------------------------+
> + |           |    +----------+     +-----------+                             |
> + |           |                     |    ..     | ----------------------------+
> + |           |                     +-----------+                             |
> + |           |                     |    ..     | ----------------------------+
> + |           |                     +-----------+                             |
> + |           |                     |    ..     | ----------------------------+
> + |           |                     +-----------+                             |
> + |           |                     |    ..     | ----------------------------+
> + |    PMD    |                     +-----------+                             |
> + |  1 .. 7   |                     |    ..     | ----------------------------+
> + |           |                     +-----------+                             |
> + |           |                     |    ..     | ----------------------------+
> + |           |                     +-----------+                             |
> + |           |                     |    4095   | ----------------------------+
> + +-----------+                     +-----------+
> +
> +Page frames of PMD 1 through 7 are allocated and mapped to the same PTE page frame
> +that contains stores tail pages. As we can see in the diagram, PMDs 1 through 7
> +all look like the same. Therefore we can map PMD 2 through 7 to PMD 1 page frame.
> +This allows to free 6 vmemmap pages per 1GB page, decreasing the overhead per
> +1GB page from 40960 bytes to 16384 bytes.
> +
> +Here is how things look after PMD tail page deduplication.
> +
> +   device-dax      page frames   struct pages(4096 pages)     page frame(2 pages)
> + +-----------+ -> +----------+ --> +-----------+   mapping to   +-------------+
> + |           |    |    0     |     |     0     | -------------> |      0      |
> + |           |    +----------+     +-----------+                +-------------+
> + |           |                     |     1     | -------------> |      1      |
> + |           |                     +-----------+                +-------------+
> + |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^ ^
> + |           |                     +-----------+                   | | | | | |
> + |           |                     |     3     | ------------------+ | | | | |
> + |           |                     +-----------+                     | | | | |
> + |           |                     |     4     | --------------------+ | | | |
> + |   PMD 0   |                     +-----------+                       | | | |
> + |           |                     |     5     | ----------------------+ | | |
> + |           |                     +-----------+                         | | |
> + |           |                     |     ..    | ------------------------+ | |
> + |           |                     +-----------+                           | |
> + |           |                     |     511   | --------------------------+ |
> + |           |                     +-----------+                             |
> + |           |                                                               |
> + |           |                                                               |
> + |           |                                                               |
> + +-----------+     page frames                                               |
> + +-----------+ -> +----------+ --> +-----------+    mapping to               |
> + |           |    |    1     |     |    512    | ----------------------------+
> + |           |    +----------+     +-----------+                             |
> + |           |     ^ ^ ^ ^ ^ ^     |    ..     | ----------------------------+
> + |           |     | | | | | |     +-----------+                             |
> + |           |     | | | | | |     |    ..     | ----------------------------+
> + |           |     | | | | | |     +-----------+                             |
> + |           |     | | | | | |     |    ..     | ----------------------------+
> + |           |     | | | | | |     +-----------+                             |
> + |           |     | | | | | |     |    ..     | ----------------------------+
> + |   PMD 1   |     | | | | | |     +-----------+                             |
> + |           |     | | | | | |     |    ..     | ----------------------------+
> + |           |     | | | | | |     +-----------+                             |
> + |           |     | | | | | |     |    ..     | ----------------------------+
> + |           |     | | | | | |     +-----------+                             |
> + |           |     | | | | | |     |    4095   | ----------------------------+
> + +-----------+     | | | | | |     +-----------+
> + |   PMD 2   | ----+ | | | | |
> + +-----------+       | | | | |
> + |   PMD 3   | ------+ | | | |
> + +-----------+         | | | |
> + |   PMD 4   | --------+ | | |
> + +-----------+           | | |
> + |   PMD 5   | ----------+ | |
> + +-----------+             | |
> + |   PMD 6   | ------------+ |
> + +-----------+               |
> + |   PMD 7   | --------------+
> + +-----------+
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5e3e153ddd3d..e9dc3e2de7be 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3088,7 +3088,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
>  pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>  p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
> -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
> +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
> +                           struct page *block);
>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>                             struct vmem_altmap *altmap, struct page *block);
>  void *vmemmap_alloc_block(unsigned long size, int node);
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index a8de6c472999..68041ca9a797 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -537,13 +537,22 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
>         return p;
>  }
>
> -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
> +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
> +                                      struct page *block)
>  {
>         pmd_t *pmd = pmd_offset(pud, addr);
>         if (pmd_none(*pmd)) {
> -               void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
> -               if (!p)
> -                       return NULL;
> +               void *p;
> +
> +               if (!block) {
> +                       p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
> +                       if (!p)
> +                               return NULL;
> +               } else {
> +                       /* See comment in vmemmap_pte_populate(). */
> +                       get_page(block);
> +                       p = page_to_virt(block);
> +               }
>                 pmd_populate_kernel(&init_mm, pmd, p);
>         }
>         return pmd;
> @@ -585,15 +594,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>         return pgd;
>  }
>
> -static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> -                                             struct vmem_altmap *altmap,
> -                                             struct page *reuse, struct page **page)
> +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
> +                                                 struct vmem_altmap *altmap,
> +                                                 struct page *reuse, pmd_t **ptr)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
>         pud_t *pud;
>         pmd_t *pmd;
> -       pte_t *pte;
>
>         pgd = vmemmap_pgd_populate(addr, node);
>         if (!pgd)
> @@ -604,9 +612,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>         pud = vmemmap_pud_populate(p4d, addr, node);
>         if (!pud)
>                 return -ENOMEM;
> -       pmd = vmemmap_pmd_populate(pud, addr, node);
> +       pmd = vmemmap_pmd_populate(pud, addr, node, reuse);
>         if (!pmd)
>                 return -ENOMEM;
> +       if (ptr)
> +               *ptr = pmd;
> +       return 0;
> +}
> +
> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> +                                             struct vmem_altmap *altmap,
> +                                             struct page *reuse, struct page **page)
> +{
> +       pmd_t *pmd;
> +       pte_t *pte;
> +
> +       if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
> +               return -ENOMEM;
> +
>         pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
>         if (!pte)
>                 return -ENOMEM;
> @@ -650,6 +673,20 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>         return vmemmap_populate_address(addr, node, NULL, NULL, page);
>  }
>
> +static int __meminit vmemmap_populate_pmd_range(unsigned long start,
> +                                               unsigned long end,
> +                                               int node, struct page *page)
> +{
> +       unsigned long addr = start;
> +
> +       for (; addr < end; addr += PMD_SIZE) {
> +               if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
>  static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>                                                      unsigned long start,
>                                                      unsigned long end, int node,
> @@ -670,6 +707,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>         offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>         if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
>             pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
> +               pmd_t *pmdp;
>                 pte_t *ptep;
>
>                 addr = start - PAGE_SIZE;
> @@ -681,11 +719,25 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>                  * the previous struct pages are mapped when trying to lookup
>                  * the last tail page.
>                  */
> -               ptep = pte_offset_kernel(pmd_off_k(addr), addr);
> -               if (!ptep)
> +               pmdp = pmd_off_k(addr);
> +               if (!pmdp)
> +                       return -ENOMEM;
> +
> +               /*
> +                * Reuse the tail pages vmemmap pmd page
> +                * See layout diagram in Documentation/vm/vmemmap_dedup.rst
> +                */
> +               if (offset % pgmap_geometry(pgmap) > PFN_PHYS(PAGES_PER_SECTION))
> +                       return vmemmap_populate_pmd_range(start, end, node,
> +                                                         pmd_page(*pmdp));
> +
> +               /* See comment above when pmd_off_k() is called. */
> +               ptep = pte_offset_kernel(pmdp, addr);
> +               if (pte_none(*ptep))
>                         return -ENOMEM;
>
>                 /*
> +                * Populate the tail pages vmemmap pmd page.
>                  * Reuse the page that was populated in the prior iteration
>                  * with just tail struct pages.
>                  */
> --
> 2.17.1
>

  reply	other threads:[~2021-07-28 20:03 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2021-07-14 19:35 ` [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-07-15  0:17   ` Dan Williams
2021-07-15  2:51   ` [External] " Muchun Song
2021-07-15  6:40     ` Christoph Hellwig
2021-07-15  9:19       ` Muchun Song
2021-07-15 13:17     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-07-15  0:19   ` Dan Williams
2021-07-15  2:53   ` [External] " Muchun Song
2021-07-15 13:17     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-07-15  0:20   ` Dan Williams
2021-07-14 19:35 ` [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-07-15  1:08   ` Dan Williams
2021-07-15 12:52     ` Joao Martins
2021-07-15 13:06       ` Joao Martins
2021-07-15 19:48       ` Dan Williams
2021-07-30 16:13         ` Joao Martins
2021-07-22  0:38       ` Jane Chu
2021-07-22 10:56         ` Joao Martins
2021-07-15 12:59     ` Christoph Hellwig
2021-07-15 13:15       ` Joao Martins
2021-07-15  6:48   ` Christoph Hellwig
2021-07-15 13:15     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2021-07-28  5:56   ` Dan Williams
2021-07-28  9:43     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
2021-07-28  6:04   ` Dan Williams
2021-07-28 10:48     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
2021-07-15  2:47   ` [External] " Muchun Song
2021-07-15 13:16     ` Joao Martins
2021-07-28  6:09   ` Dan Williams
2021-07-14 19:35 ` [PATCH v3 08/14] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
2021-07-28  6:55   ` Dan Williams
2021-07-28 15:35     ` Joao Martins
2021-07-28 18:03       ` Dan Williams
2021-07-28 18:54         ` Joao Martins
2021-07-28 20:04           ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 09/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
2021-07-28  7:28   ` Dan Williams
2021-07-28 15:56     ` Joao Martins
2021-07-28 16:08       ` Dan Williams
2021-07-28 16:12         ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 10/14] device-dax: use ALIGN() for determining pgoff Joao Martins
2021-07-28  7:29   ` Dan Williams
2021-07-28 15:56     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
2021-07-28  7:30   ` Dan Williams
2021-07-28 15:56     ` Joao Martins
2021-08-06 12:28       ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 12/14] device-dax: compound pagemap support Joao Martins
2021-07-14 23:36   ` Dan Williams
2021-07-15 12:00     ` Joao Martins
2021-07-27 23:51       ` Dan Williams
2021-07-28  9:36         ` Joao Martins
2021-07-28 18:51           ` Dan Williams
2021-07-28 18:59             ` Joao Martins
2021-07-28 19:03               ` Dan Williams
2021-07-14 19:35 ` [PATCH v3 13/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-07-28 19:55   ` Dan Williams
2021-07-28 20:07     ` Joao Martins
2021-07-28 20:23       ` Dan Williams
2021-08-25 19:10         ` Joao Martins
2021-08-25 19:15           ` Matthew Wilcox
2021-08-25 19:26             ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins
2021-07-28 20:03   ` Dan Williams [this message]
2021-07-28 20:08     ` Joao Martins
2021-07-14 21:48 ` [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Andrew Morton
2021-07-14 23:47   ` Dan Williams
2021-07-22  2:24   ` Matthew Wilcox
2021-07-22 10:53     ` Joao Martins
2021-07-27 23:23       ` Dan Williams
2021-08-02 10:40         ` Joao Martins
2021-08-02 14:06           ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4jC9He7tnTnbiracHZ9P9XSWsH4pJMKFip6-nSbsBWyrg@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=dave.jiang@intel.com \
    --cc=jane.chu@oracle.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=songmuchun@bytedance.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    --subject='Re: [PATCH v3 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
on how to clone and mirror all data and code used for this inbox