From: Joao Martins <joao.m.martins@oracle.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Linux MM <linux-mm@kvack.org>,
linux-nvdimm <linux-nvdimm@lists.01.org>,
Matthew Wilcox <willy@infradead.org>,
Jason Gunthorpe <jgg@ziepe.ca>, Jane Chu <jane.chu@oracle.com>,
Muchun Song <songmuchun@bytedance.com>,
Mike Kravetz <mike.kravetz@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
Date: Thu, 6 May 2021 12:01:47 +0100 [thread overview]
Message-ID: <63ce653d-4d21-c955-7b81-abba25b9e4b6@oracle.com> (raw)
In-Reply-To: <CAPcyv4hcMg6cJctnY6V=ngL9GoPVvd3sSYRbS-WZoTg=SnrhEw@mail.gmail.com>
On 5/6/21 2:18 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
>> means that pages are mapped at a given huge page alignment and utilize
>> uses compound pages as opposed to order-0 pages.
>>
>> To minimize struct page overhead we take advantage of the fact that
>
> I'm not sure if Andrew is a member of the "we" police, but I am on
> team "imperative tense".
>
That was my mistake once again. You did mentioned it over the RFC.
A few days after submitting the series I noticed[0] this patch and the
next one had still a few instances of misplaced usage of 'we', in addition
to a little unit mistake I made over the cover-letter.
[0] https://lore.kernel.org/linux-mm/a2b7dc3a-3f1d-b8e1-4770-e261055f2b17@oracle.com/
> "Take advantage of the fact that most tail pages look the same (except
> the first two) to minimize struct page overhead."
>
> ...I think all the other "we"s below can be dropped without losing any meaning.
>
Indeed.
>> most tail pages look the same (except the first two). We allocate a
>> separate page for the vmemmap area which contains the head page and
>> separate for the next 64 pages. The rest of the subsections then reuse
>> this tail vmemmap page to initialize the rest of the tail pages.
>>
>> Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
>> when initializing compound pagemap with big enough @align (e.g. 1G
>> PUD) it will cross various sections. To be able to reuse tail pages
>> across sections belonging to the same gigantic page we fetch the
>> @range being mapped (nr_ranges + 1). If the section being mapped is
>> not offset 0 of the @align, then lookup the PFN of the struct page
>> address that preceeds it and use that to populate the entire
>
> s/preceeds/precedes/
>
/me nods
>> section.
>>
>> On compound pagemaps with 2M align, this lets mechanism saves 6 pages
>
> s/lets mechanism saves 6 pages/this mechanism lets 6 pages be saved/
>
Will fix.
>> out of the 8 necessary PFNs necessary to set the subsection's 512
>> struct pages being mapped. On a 1G compound pagemap it saves
>> 4094 pages.
>>
>> Altmap isn't supported yet, given various restrictions in altmap pfn
>> allocator, thus fallback to the already in use vmemmap_populate().
>
> Perhaps also note that altmap for devmap mappings was there to relieve
> the pressure of inordinate amounts of memmap space to map terabytes of
> PMEM. With compound pages the motivation for altmaps for PMEM is
> reduced.
>
OK, I suppose it's worth highlighting it.
But perhaps how 'reduced' this motivation is still not 100% clear to me.
Like we discussed over the RFC, the DEVMAP_PMD geometry we still take a
non-trivial hit of ~4G per Tb. And altmap is a rather quick way of having
heavily disproportioned RAM to PMEM ratios working without pay the RAM
penalty. Specially with the pinning numbers so heavily reduced.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> include/linux/mm.h | 2 +-
>> mm/memremap.c | 1 +
>> mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++----
>> 3 files changed, 131 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 61474602c2b1..49d717ae40ae 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>> pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>> pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>> pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>> - struct vmem_altmap *altmap);
>> + struct vmem_altmap *altmap, void *block);
>> void *vmemmap_alloc_block(unsigned long size, int node);
>> struct vmem_altmap;
>> void *vmemmap_alloc_block_buf(unsigned long size, int node,
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index d160853670c4..2e6bc0b1ff00 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>> {
>> struct mhp_params params = {
>> .altmap = pgmap_altmap(pgmap),
>> + .pgmap = pgmap,
>> .pgprot = PAGE_KERNEL,
>> };
>> const int nr_range = pgmap->nr_range;
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index 8814876edcfa..f57c5eada099 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
>> }
>>
>> pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>> - struct vmem_altmap *altmap)
>> + struct vmem_altmap *altmap, void *block)
>
> For type-safety I would make this argument a 'struct page *' and drop
> the virt_to_page().
>
Will do.
>> {
>> pte_t *pte = pte_offset_kernel(pmd, addr);
>> if (pte_none(*pte)) {
>> pte_t entry;
>> - void *p;
>> -
>> - p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
>> - if (!p)
>> - return NULL;
>> + void *p = block;
>> +
>> + if (!block) {
>> + p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
>> + if (!p)
>> + return NULL;
>> + } else if (!altmap) {
>> + get_page(virt_to_page(block));
>
> This is either super subtle, or there is a missing put_page(). I'm
> assuming that in the shutdown path the same page will be freed
> multiple times because the same page is mapped multiple times.
>
It's the former (subtlety):
1) When a PTE/PMD entry is freed from the init_mm there's a a free_pages() call to
this page allocated above. free_pages() at the top does a put_page_testzero() prior to
freeing the actual page.
2) Given this codepath is solely used by pgmap infra, we already have the page allocator
is available (i.e. slab_is_available()), hence we will always go towards the
alloc_pages_node() codepath.
> Have you validated that the accounting is correct and all allocated
> pages get freed? I feel this needs more than a comment, it needs a
> test to validate that the accounting continues to happen correctly as
> future vmemmap changes land.
>
I can add a comment *at least*.
I checked the accounting. But let me be extra pedantic on this part and recheck
this once again as I settled with this part some time ago. I will followup on this
chunk, should this part be broken in some way.
>> + }
>> entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
>> set_pte_at(&init_mm, addr, pte, entry);
>> }
>> @@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>> }
>>
>> static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> - struct vmem_altmap *altmap)
>> + struct vmem_altmap *altmap,
>> + void *page, void **ptr)
>> {
>> pgd_t *pgd;
>> p4d_t *p4d;
>> @@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> pmd = vmemmap_pmd_populate(pud, addr, node);
>> if (!pmd)
>> return -ENOMEM;
>> - pte = vmemmap_pte_populate(pmd, addr, node, altmap);
>> + pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
>> if (!pte)
>> return -ENOMEM;
>> vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
>> +
>> + if (ptr)
>> + *ptr = __va(__pfn_to_phys(pte_pfn(*pte)));
>> + return 0;
>> }
>>
>> int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>> @@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>> unsigned long addr = start;
>>
>> for (; addr < end; addr += PAGE_SIZE) {
>> - if (vmemmap_populate_address(addr, node, altmap))
>> + if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
>> + return -ENOMEM;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int __meminit vmemmap_populate_range(unsigned long start,
>> + unsigned long end,
>> + int node, void *page)
>> +{
>> + unsigned long addr = start;
>> +
>> + for (; addr < end; addr += PAGE_SIZE) {
>> + if (vmemmap_populate_address(addr, node, NULL, page, NULL))
>> + return -ENOMEM;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>> + void **ptr)
>> +{
>> + return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
>> +}
>> +
>> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
>
> I think this can be replaced with a call to follow_pte(&init_mm...).
>
Ah, of course! That would shorten things up too.
>> +{
>> + pgd_t *pgd;
>> + p4d_t *p4d;
>> + pud_t *pud;
>> + pmd_t *pmd;
>> + pte_t *pte;
>> +
>> + pgd = pgd_offset_k(addr);
>> + if (pgd_none(*pgd))
>> + return NULL;
>> +
>> + p4d = p4d_offset(pgd, addr);
>> + if (p4d_none(*p4d))
>> + return NULL;
>> +
>> + pud = pud_offset(p4d, addr);
>> + if (pud_none(*pud))
>> + return NULL;
>> +
>> + pmd = pmd_offset(pud, addr);
>> + if (pmd_none(*pmd))
>> + return NULL;
>> +
>> + pte = pte_offset_kernel(pmd, addr);
>> + if (pte_none(*pte))
>> + return NULL;
>> +
>> + return pte;
>> +}
>> +
>> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>> + unsigned long start,
>> + unsigned long end, int node,
>> + struct dev_pagemap *pgmap)
>> +{
>> + unsigned long offset, size, addr;
>> +
>> + /*
>> + * For compound pages bigger than section size (e.g. 1G) fill the rest
>
> Oh, I had to read this a couple times because I thought the "e.g." was
> referring to section size. How about:
>
> s/(e.g. 1G)/(e.g. x86 1G compound pages with 2M subsection size)/
>
Will fix. But instead I'll fix to:
e.g. x86 1G compound pages with 128M section size
Because vmemmap_populate_compound_pages() is called per section, while being
subsection-aligned.
>> + * of sections as tail pages.
>> + *
>> + * Note that memremap_pages() resets @nr_range value and will increment
>> + * it after each range successful onlining. Thus the value or @nr_range
>> + * at section memmap populate corresponds to the in-progress range
>> + * being onlined that we care about.
>> + */
>> + offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>> + if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
>> + pgmap_align(pgmap) > SUBSECTION_SIZE) {
>> + pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
>> +
>> + if (!ptep)
>> + return -ENOMEM;
>
> Side note: I had been resistant to adding 'struct page' for altmap
> pages, but that would be a requirement to enable compound pages +
> altmap.
>
Fair enough.
I was thinking about something a little simpler like adding a refcount to
altmap pfn allocator, to avoid arch changes and be a little more tidy
on space (16bytes rather than 64bytes). But I suspect I can avoid the said
arch changes even with a backend struct page.
>> +
>
> Perhaps a comment here to the effect "recall the page that was
> allocated in a prior iteration and fill it in with tail pages".
>
Will add.
>> + return vmemmap_populate_range(start, end, node,
>> + page_to_virt(pte_page(*ptep)));
>
> If the @block parameter changes to a 'struct page *' it also saves
> some typing here.
>
Indeed.
>> + }
>> +
>> + size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page));
>> + for (addr = start; addr < end; addr += size) {
>> + unsigned long next = addr, last = addr + size;
>> + void *block;
>> +
>> + /* Populate the head page vmemmap page */
>> + if (vmemmap_populate_page(addr, node, NULL))
>> + return -ENOMEM;
>> +
>> + /* Populate the tail pages vmemmap page */
>> + block = NULL;
>> + next = addr + PAGE_SIZE;
>> + if (vmemmap_populate_page(next, node, &block))
>> + return -ENOMEM;
>> +
>> + /* Reuse the previous page for the rest of tail pages */
>> + next += PAGE_SIZE;
>> + if (vmemmap_populate_range(next, last, node, block))
>> return -ENOMEM;
>> }
>>
>> @@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>> {
>> unsigned long start = (unsigned long) pfn_to_page(pfn);
>> unsigned long end = start + nr_pages * sizeof(struct page);
>> + unsigned int align = pgmap_align(pgmap);
>> + int r;
>>
>> if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>> !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>> return NULL;
>>
>> - if (vmemmap_populate(start, end, nid, altmap))
>> + if (align > PAGE_SIZE && !altmap)
>
> I also think this code will read better the devmap_geometry enum.
>
True.
I am starting to like the ring of it with @geometry rather than @align to represent
how metadata is used. Should avoid confusion between device-dax @align and pagemap
@align.
>> + r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
>> + else
>> + r = vmemmap_populate(start, end, nid, altmap);
>> +
>> + if (r < 0)
>> return NULL;
>>
>> return pfn_to_page(pfn);
>> --
>> 2.17.1
>>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
next prev parent reply other threads:[~2021-05-06 11:02 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-04-24 0:12 ` Dan Williams
2021-04-24 19:00 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-04-24 0:16 ` Dan Williams
2021-04-24 19:05 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-04-24 0:18 ` Dan Williams
2021-04-24 19:05 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-05-05 18:44 ` Dan Williams
2021-05-05 18:58 ` Matthew Wilcox
2021-05-05 19:49 ` Joao Martins
2021-05-05 22:20 ` Dan Williams
2021-05-05 22:36 ` Joao Martins
2021-05-05 23:03 ` Dan Williams
2021-05-06 10:12 ` Joao Martins
2021-05-18 17:27 ` Joao Martins
2021-05-18 19:56 ` Jane Chu
2021-05-19 11:29 ` Joao Martins
2021-05-06 8:05 ` Aneesh Kumar K.V
2021-05-06 10:23 ` Joao Martins
2021-05-06 11:43 ` Matthew Wilcox
2021-05-06 12:15 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2021-05-05 22:34 ` Dan Williams
2021-05-05 22:37 ` Joao Martins
2021-05-05 23:14 ` Dan Williams
2021-05-06 10:24 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
2021-05-05 22:43 ` Dan Williams
2021-05-06 10:27 ` Joao Martins
2021-05-06 18:36 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
2021-05-06 1:18 ` Dan Williams
2021-05-06 11:01 ` Joao Martins [this message]
2021-05-10 19:19 ` Dan Williams
2021-05-13 18:45 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-04-01 9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=63ce653d-4d21-c955-7b81-abba25b9e4b6@oracle.com \
--to=joao.m.martins@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=jane.chu@oracle.com \
--cc=jgg@ziepe.ca \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@lists.01.org \
--cc=mike.kravetz@oracle.com \
--cc=songmuchun@bytedance.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).