From: Roman Gushchin <guro@fb.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>, <linux-mm@kvack.org>,
Rik van Riel <riel@surriel.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Matthew Wilcox <willy@infradead.org>,
Shakeel Butt <shakeelb@google.com>,
Yang Shi <yang.shi@linux.alibaba.com>,
David Nellans <dnellans@nvidia.com>,
<linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64
Date: Thu, 3 Sep 2020 09:25:27 -0700 [thread overview]
Message-ID: <20200903162527.GF60440@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <20200903073254.GP4617@dhcp22.suse.cz>
On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > From: Zi Yan <ziy@nvidia.com>
> >
> > Hi all,
> >
> > This patchset adds support for 1GB THP on x86_64. It is on top of
> > v5.9-rc2-mmots-2020-08-25-21-13.
> >
> > 1GB THP is more flexible for reducing translation overhead and increasing the
> > performance of applications with large memory footprint without application
> > changes compared to hugetlb.
>
> Please be more specific about usecases. This better have some strong
> ones because THP code is complex enough already to add on top solely
> based on a generic TLB pressure easing.
Hello, Michal!
We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
performance wins on some workloads.
Historically we allocated gigantic pages at the boot time, but recently moved
to cma-based dynamic approach. Still, hugetlbfs interface requires more management
than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
see it as a very useful feature.
Given the cost of an allocation, I'm slightly skeptical about an automatic
heuristics-based approach, but if an application can explicitly mark target areas
with madvise(), I don't see why it wouldn't work.
In our case we'd like to have a reliable way to get 1 GB THPs at some point
(usually at the start of an application), and transparently destroy them on
the application exit.
Once we'll have the patchset in a relatively good shape, I'll be happy to give
it a test in our environment and share results.
Thanks!
>
> > Design
> > =======
> >
> > 1GB THP implementation looks similar to exiting THP code except some new designs
> > for the additional page table level.
> >
> > 1. Page table deposit and withdraw using a new pagechain data structure:
> > instead of one PTE page table page, 1GB THP requires 513 page table pages
> > (one PMD page table page and 512 PTE page table pages) to be deposited
> > at the page allocaiton time, so that we can split the page later. Currently,
> > the page table deposit is using ->lru, thus only one page can be deposited.
> > A new pagechain data structure is added to enable multi-page deposit.
> >
> > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
> > and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
> > PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
> > sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
> > page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
> > page[N*512 + 3].compound_mapcount.
> >
> > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
> > to use something less intrusive. So all 1GB THPs are allocated from reserved
> > CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
> > THP is cleared as the resulting pages can be freed via normal page free path.
> > We can fall back to alloc_contig_pages for 1GB THP if necessary.
>
> Do those pages get instantiated during the page fault or only via
> khugepaged? This is an important design detail because then we have to
> think carefully about how much automatic we want this to be. Memory
> overhead can be quite large with 2MB THPs already. Also what about the
> allocation overhead? Do you have any numbers?
>
> Maybe all these details are described in the patcheset but the cover
> letter should contain all that information. It doesn't make much sense
> to dig into details in a patchset this large without having an idea how
> feasible this is.
>
> Thanks.
>
> > Patch Organization
> > =======
> >
> > Patch 01 adds the new pagechain data structure.
> >
> > Patch 02 to 13 adds 1GB THP support in variable places.
> >
> > Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
> >
> > Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
> >
> > Patch 16 use hugepage_cma reservation for 1GB THP allocation.
> >
> >
> > Any suggestions and comments are welcome.
> >
> >
> > Zi Yan (16):
> > mm: add pagechain container for storing multiple pages.
> > mm: thp: 1GB anonymous page implementation.
> > mm: proc: add 1GB THP kpageflag.
> > mm: thp: 1GB THP copy on write implementation.
> > mm: thp: handling 1GB THP reference bit.
> > mm: thp: add 1GB THP split_huge_pud_page() function.
> > mm: stats: make smap stats understand PUD THPs.
> > mm: page_vma_walk: teach it about PMD-mapped PUD THP.
> > mm: thp: 1GB THP support in try_to_unmap().
> > mm: thp: split 1GB THPs at page reclaim.
> > mm: thp: 1GB THP follow_p*d_page() support.
> > mm: support 1GB THP pagemap support.
> > mm: thp: add a knob to enable/disable 1GB THPs.
> > mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
> > hugetlb: cma: move cma reserve function to cma.c.
> > mm: thp: use cma reservation for pud thp allocation.
> >
> > .../admin-guide/kernel-parameters.txt | 2 +-
> > arch/arm64/mm/hugetlbpage.c | 2 +-
> > arch/powerpc/mm/hugetlbpage.c | 2 +-
> > arch/x86/include/asm/pgalloc.h | 68 ++
> > arch/x86/include/asm/pgtable.h | 26 +
> > arch/x86/kernel/setup.c | 8 +-
> > arch/x86/mm/pgtable.c | 38 +
> > drivers/base/node.c | 3 +
> > fs/proc/meminfo.c | 2 +
> > fs/proc/page.c | 2 +
> > fs/proc/task_mmu.c | 122 ++-
> > include/linux/cma.h | 18 +
> > include/linux/huge_mm.h | 84 +-
> > include/linux/hugetlb.h | 12 -
> > include/linux/memcontrol.h | 5 +
> > include/linux/mm.h | 29 +-
> > include/linux/mm_types.h | 1 +
> > include/linux/mmu_notifier.h | 13 +
> > include/linux/mmzone.h | 1 +
> > include/linux/page-flags.h | 47 +
> > include/linux/pagechain.h | 73 ++
> > include/linux/pgtable.h | 34 +
> > include/linux/rmap.h | 10 +-
> > include/linux/swap.h | 2 +
> > include/linux/vm_event_item.h | 7 +
> > include/uapi/linux/kernel-page-flags.h | 2 +
> > kernel/events/uprobes.c | 4 +-
> > kernel/fork.c | 5 +
> > mm/cma.c | 119 +++
> > mm/gup.c | 60 +-
> > mm/huge_memory.c | 939 +++++++++++++++++-
> > mm/hugetlb.c | 114 +--
> > mm/internal.h | 2 +
> > mm/khugepaged.c | 6 +-
> > mm/ksm.c | 4 +-
> > mm/memcontrol.c | 13 +
> > mm/memory.c | 51 +-
> > mm/mempolicy.c | 21 +-
> > mm/migrate.c | 12 +-
> > mm/page_alloc.c | 57 +-
> > mm/page_vma_mapped.c | 129 ++-
> > mm/pgtable-generic.c | 56 ++
> > mm/rmap.c | 289 ++++--
> > mm/swap.c | 31 +
> > mm/swap_slots.c | 2 +
> > mm/swapfile.c | 8 +-
> > mm/userfaultfd.c | 2 +-
> > mm/util.c | 16 +-
> > mm/vmscan.c | 58 +-
> > mm/vmstat.c | 8 +
> > 50 files changed, 2270 insertions(+), 349 deletions(-)
> > create mode 100644 include/linux/pagechain.h
> >
> > --
> > 2.28.0
> >
>
> --
> Michal Hocko
> SUSE Labs
next prev parent reply other threads:[~2020-09-03 16:25 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-02 18:06 [RFC PATCH 00/16] 1GB THP support on x86_64 Zi Yan
2020-09-02 18:06 ` [RFC PATCH 01/16] mm: add pagechain container for storing multiple pages Zi Yan
2020-09-02 20:29 ` Randy Dunlap
2020-09-02 20:48 ` Zi Yan
2020-09-03 3:15 ` Matthew Wilcox
2020-09-07 12:22 ` Kirill A. Shutemov
2020-09-07 15:11 ` Zi Yan
2020-09-09 13:46 ` Kirill A. Shutemov
2020-09-09 14:15 ` Zi Yan
2020-09-02 18:06 ` [RFC PATCH 02/16] mm: thp: 1GB anonymous page implementation Zi Yan
2020-09-02 18:06 ` [RFC PATCH 03/16] mm: proc: add 1GB THP kpageflag Zi Yan
2020-09-09 13:46 ` Kirill A. Shutemov
2020-09-02 18:06 ` [RFC PATCH 04/16] mm: thp: 1GB THP copy on write implementation Zi Yan
2020-09-02 18:06 ` [RFC PATCH 05/16] mm: thp: handling 1GB THP reference bit Zi Yan
2020-09-09 14:09 ` Kirill A. Shutemov
2020-09-09 14:36 ` Zi Yan
2020-09-02 18:06 ` [RFC PATCH 06/16] mm: thp: add 1GB THP split_huge_pud_page() function Zi Yan
2020-09-09 14:18 ` Kirill A. Shutemov
2020-09-09 14:19 ` Zi Yan
2020-09-02 18:06 ` [RFC PATCH 07/16] mm: stats: make smap stats understand PUD THPs Zi Yan
2020-09-02 18:06 ` [RFC PATCH 08/16] mm: page_vma_walk: teach it about PMD-mapped PUD THP Zi Yan
2020-09-02 18:06 ` [RFC PATCH 09/16] mm: thp: 1GB THP support in try_to_unmap() Zi Yan
2020-09-02 18:06 ` [RFC PATCH 10/16] mm: thp: split 1GB THPs at page reclaim Zi Yan
2020-09-02 18:06 ` [RFC PATCH 11/16] mm: thp: 1GB THP follow_p*d_page() support Zi Yan
2020-09-02 18:06 ` [RFC PATCH 12/16] mm: support 1GB THP pagemap support Zi Yan
2020-09-02 18:06 ` [RFC PATCH 13/16] mm: thp: add a knob to enable/disable 1GB THPs Zi Yan
2020-09-02 18:06 ` [RFC PATCH 14/16] mm: page_alloc: >=MAX_ORDER pages allocation an deallocation Zi Yan
2020-09-02 18:06 ` [RFC PATCH 15/16] hugetlb: cma: move cma reserve function to cma.c Zi Yan
2020-09-02 18:06 ` [RFC PATCH 16/16] mm: thp: use cma reservation for pud thp allocation Zi Yan
2020-09-02 18:40 ` [RFC PATCH 00/16] 1GB THP support on x86_64 Jason Gunthorpe
2020-09-02 18:45 ` Zi Yan
2020-09-02 18:48 ` Jason Gunthorpe
2020-09-02 19:05 ` Zi Yan
2020-09-02 19:57 ` Jason Gunthorpe
2020-09-02 20:29 ` Zi Yan
2020-09-03 16:40 ` Jason Gunthorpe
2020-09-03 16:55 ` Matthew Wilcox
2020-09-03 17:08 ` Jason Gunthorpe
2020-09-03 7:32 ` Michal Hocko
2020-09-03 16:25 ` Roman Gushchin [this message]
2020-09-03 16:50 ` Jason Gunthorpe
2020-09-03 17:01 ` Matthew Wilcox
2020-09-03 17:18 ` Jason Gunthorpe
2020-09-03 20:57 ` Mike Kravetz
2020-09-03 21:06 ` Roman Gushchin
2020-09-04 7:42 ` Michal Hocko
2020-09-04 21:10 ` Roman Gushchin
2020-09-07 7:20 ` Michal Hocko
2020-09-08 15:09 ` Zi Yan
2020-09-08 19:58 ` Roman Gushchin
2020-09-09 4:01 ` John Hubbard
2020-09-09 7:15 ` Michal Hocko
2020-09-03 14:23 ` Kirill A. Shutemov
2020-09-03 16:30 ` Roman Gushchin
2020-09-08 11:57 ` David Hildenbrand
2020-09-08 14:05 ` Zi Yan
2020-09-08 14:22 ` David Hildenbrand
2020-09-08 15:36 ` Zi Yan
2020-09-08 14:27 ` Matthew Wilcox
2020-09-08 15:50 ` Zi Yan
2020-09-09 12:11 ` Jason Gunthorpe
2020-09-09 12:32 ` Matthew Wilcox
2020-09-09 13:14 ` Jason Gunthorpe
2020-09-09 13:27 ` David Hildenbrand
2020-09-10 10:02 ` William Kucharski
2020-09-08 14:35 ` Michal Hocko
2020-09-08 14:41 ` Rik van Riel
2020-09-08 15:02 ` David Hildenbrand
2020-09-09 7:04 ` Michal Hocko
2020-09-09 13:19 ` Rik van Riel
2020-09-09 13:43 ` David Hildenbrand
2020-09-09 13:49 ` Rik van Riel
2020-09-09 13:54 ` David Hildenbrand
2020-09-10 7:32 ` Michal Hocko
2020-09-10 8:27 ` David Hildenbrand
2020-09-10 14:21 ` Zi Yan
2020-09-10 14:34 ` David Hildenbrand
2020-09-10 14:41 ` Zi Yan
2020-09-10 15:15 ` David Hildenbrand
2020-09-10 13:32 ` Rik van Riel
2020-09-10 14:30 ` Zi Yan
2020-09-09 13:59 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200903162527.GF60440@carbon.dhcp.thefacebook.com \
--to=guro@fb.com \
--cc=dnellans@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=riel@surriel.com \
--cc=shakeelb@google.com \
--cc=willy@infradead.org \
--cc=yang.shi@linux.alibaba.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).