linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>,
	linux-mm@kvack.org,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>,
	Matthew Wilcox <willy@infradead.org>,
	Shakeel Butt <shakeelb@google.com>,
	Yang Shi <shy828301@gmail.com>, Jason Gunthorpe <jgg@nvidia.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	William Kucharski <william.kucharski@oracle.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	David Nellans <dnellans@nvidia.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
Date: Tue, 6 Oct 2020 13:59:27 +0200	[thread overview]
Message-ID: <20201006115927.GD29020@dhcp22.suse.cz> (raw)
In-Reply-To: <8850ABA0-0B42-41DB-9ADC-0E2BB1F841D0@nvidia.com>

On Mon 05-10-20 14:05:17, Zi Yan wrote:
> On 5 Oct 2020, at 13:39, David Hildenbrand wrote:
> 
> >>>> consideting that 2MB THP have turned out to be quite a pain but
> >>>> situation has settled over time. Maybe our current code base is prepared
> >>>> for that much better.
> >>
> >> I am planning to refactor my code further to reduce the amount of
> >> the added code, since PUD THP is very similar to PMD THP. One thing
> >> I want to achieve is to enable split_huge_page to split any order of
> >> pages to a group of any lower order of pages. A lot of code in this
> >> patchset is replicating the same behavior of PMD THP at PUD level.
> >> It might be possible to deduplicate most of the code.
> >>
> >>>>
> >>>> Exposing that interface to the userspace is a different story of course.
> >>>> I do agree that we likely do not want to be very explicit about that.
> >>>> E.g. an interface for address space defragmentation without any more
> >>>> specifics sounds like a useful feature to me. It will be up to the
> >>>> kernel to decide which huge pages to use.
> >>>
> >>> Yes, I think one important feature would be that we don't end up placing
> >>> a gigantic page where only a handful of pages are actually populated
> >>> without green light from the application - because that's what some user
> >>> space applications care about (not consuming more memory than intended.
> >>> IIUC, this is also what this patch set does). I'm fine with placing
> >>> gigantic pages if it really just "defragments" the address space layout,
> >>> without filling unpopulated holes.
> >>>
> >>> Then, this would be mostly invisible to user space, and we really
> >>> wouldn't have to care about any configuration.
> >>
> >>
> >> I agree that the interface should be as simple as no configuration to
> >> most users. But I also wonder why we have hugetlbfs to allow users to
> >> specify different kinds of page sizes, which seems against the discussion
> >> above. Are we assuming advanced users should always use hugetlbfs instead
> >> of THPs?
> >
> > Well, with hugetlbfs you get a real control over which pagesizes to use.
> > No mixture, guarantees.
> >
> > In some environments you might want to control which application gets
> > which pagesize. I know of database applications and hypervisors that
> > sometimes really want 2MB huge pages instead of 1GB huge pages. And
> > sometimes you really want/need 1GB huge pages (e.g., low-latency
> > applications, real-time KVM, ...).
> >
> > Simple example: KVM with postcopy live migration
> >
> > While 2MB huge pages work reasonably fine, migrating 1GB gigantic pages
> > on demand (via userfaultdfd) is a painfully slow / impractical.
> 
> 
> The real control of hugetlbfs comes from the interfaces provided by
> the kernel. If kernel provides similar interfaces to control page sizes
> of THPs, it should work the same as hugetlbfs. Mixing page sizes usually
> comes from system memory fragmentation and hugetlbfs does not have this
> mixture because of its special allocation pools not because of the code
> itself.

Not really. Hugetlb is defined to provide a consistent and single page
size access to the memory. To the degree that you fail early if you
cannot guarantee that. This is not an implementation detail. This is the
semantic of the feature. Control goes along with the interface.

> If THPs are allocated from the same pools, they would act
> the same as hugetlbfs. What am I missing here?

THPs are a completely different beast. They are aiming to be transparent
so that user doesn't really have to control them explicitly[1]. They should
be dynamically created and demoted as the system manages resources
behind users back. In short they optimize rather than guanratee. This is
also the reason why a complete control sounds quite alien to me. Say you
explicitly ask for THP_SIZEFOO but the kernel decides a completely
different size later on. What is the actual contract you as a user are
getting?

In an ideal world the kernel would pick up the best large page
automagically. I am a bit skeptical we will reach such an enlightment
soon (if ever) so a certain level of hinting is likely needed to prevent
2MB THP fiasco again [1]. But the control should correspond to the
functionality users are getting.

> I just do not get why hugetlbfs is so special that it can have pagesize
> fine control when normal pages cannot get. The “it should be invisible
> to userpsace” argument suddenly does not hold for hugetlbfs.

In short it provides a guarantee.

Does the above clarifies it a bit?


[1] this is not entirely true though because there is a non-trivial
admin interface around THP. Mostly because they turned out to be too
transparent and many people do care about internal fragmentation,
allocation latency, locality (small page on a local node or a large on a
slightly further one?) or simply follow a cargo cult - just have a look
how many admin guides recommend disabling THPs. We got seriously burned
by 2MB THP because of the way how they were enforced on users.
-- 
Michal Hocko
SUSE Labs


  parent reply	other threads:[~2020-10-06 11:59 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
2020-09-28 17:53 ` [RFC PATCH v2 01/30] mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 02/30] mm: pagewalk: use READ_ONCE when reading the PMD " Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 03/30] mm: thp: use single linked list for THP page table page deposit Zi Yan
2020-09-28 19:34   ` Matthew Wilcox
2020-09-28 20:34     ` Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 04/30] mm: add new helper functions to allocate one PMD page with 512 PTE pages Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 05/30] mm: thp: add page table deposit/withdraw functions for PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 06/30] mm: change thp_order and thp_nr as we will have not just PMD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 07/30] mm: thp: add anonymous PUD THP page fault support without enabling it Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 08/30] mm: thp: add PUD THP support for copy_huge_pud Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 09/30] mm: thp: add PUD THP support to zap_huge_pud Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 10/30] fs: proc: add PUD THP kpageflag Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 11/30] mm: thp: handling PUD THP reference bit Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 12/30] mm: rmap: add mappped/unmapped page order to anonymous page rmap functions Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 13/30] mm: rmap: add map_order to page_remove_anon_compound_rmap Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 14/30] mm: thp: add PUD THP split_huge_pud_page() function Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 15/30] mm: thp: add PUD THP to deferred split list when PUD mapping is gone Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 16/30] mm: debug: adapt dump_page to PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 17/30] mm: thp: PUD THP COW splits PUD page and falls back to PMD page Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 18/30] mm: thp: PUD THP follow_p*d_page() support Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 19/30] mm: stats: make smap stats understand PUD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 20/30] mm: page_vma_walk: teach it about PMD-mapped PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 21/30] mm: thp: PUD THP support in try_to_unmap() Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 22/30] mm: thp: split PUD THPs at page reclaim Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 23/30] mm: support PUD THP pagemap support Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 24/30] mm: madvise: add page size options to MADV_HUGEPAGE and MADV_NOHUGEPAGE Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 25/30] mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37 Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 26/30] mm: thp: add a global knob to enable/disable PUD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 27/30] mm: thp: make PUD THP size public Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 28/30] hugetlb: cma: move cma reserve function to cma.c Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 29/30] mm: thp: use cma reservation for pud thp allocation Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 30/30] mm: thp: enable anonymous PUD THP at page fault path Zi Yan
2020-09-30 11:55 ` [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Michal Hocko
2020-10-01 15:14   ` Zi Yan
2020-10-02  7:32     ` Michal Hocko
2020-10-02  7:50       ` David Hildenbrand
2020-10-02  8:10         ` Michal Hocko
2020-10-02  8:30           ` David Hildenbrand
2020-10-05 15:03             ` Zi Yan
2020-10-05 15:55               ` Matthew Wilcox
2020-10-05 17:04                 ` Roman Gushchin
2020-10-05 19:12                 ` Zi Yan
2020-10-05 19:37                   ` Matthew Wilcox
2020-10-05 17:16               ` Roman Gushchin
2020-10-05 17:27                 ` David Hildenbrand
2020-10-05 18:25                   ` Roman Gushchin
2020-10-05 18:33                     ` David Hildenbrand
2020-10-05 19:11                       ` Roman Gushchin
2020-10-06  8:25                         ` David Hildenbrand
2020-10-05 17:39               ` David Hildenbrand
2020-10-05 18:05                 ` Zi Yan
2020-10-05 18:48                   ` David Hildenbrand
2020-10-06 11:59                   ` Michal Hocko [this message]
2020-10-05 15:34         ` Zi Yan
2020-10-05 17:30           ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201006115927.GD29020@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=aarcange@redhat.com \
    --cc=david@redhat.com \
    --cc=dnellans@nvidia.com \
    --cc=guro@fb.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=riel@surriel.com \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=william.kucharski@oracle.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).