linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Roman Gushchin <guro@fb.com>
To: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>, Michal Hocko <mhocko@suse.com>,
	<linux-mm@kvack.org>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Rik van Riel <riel@surriel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Shakeel Butt <shakeelb@google.com>,
	Yang Shi <shy828301@gmail.com>, Jason Gunthorpe <jgg@nvidia.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	William Kucharski <william.kucharski@oracle.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	David Nellans <dnellans@nvidia.com>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
Date: Mon, 5 Oct 2020 12:11:18 -0700	[thread overview]
Message-ID: <20201005191118.GB3001706@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <824eee1c-a47b-361b-ad5b-6ed64a9cbd38@redhat.com>

On Mon, Oct 05, 2020 at 08:33:44PM +0200, David Hildenbrand wrote:
> On 05.10.20 20:25, Roman Gushchin wrote:
> > On Mon, Oct 05, 2020 at 07:27:47PM +0200, David Hildenbrand wrote:
> >> On 05.10.20 19:16, Roman Gushchin wrote:
> >>> On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> >>>> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> >>>>
> >>>>> On 02.10.20 10:10, Michal Hocko wrote:
> >>>>>> On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
> >>>>>>>>>> - huge page sizes controllable by the userspace?
> >>>>>>>>>
> >>>>>>>>> It might be good to allow advanced users to choose the page sizes, so they
> >>>>>>>>> have better control of their applications.
> >>>>>>>>
> >>>>>>>> Could you elaborate more? Those advanced users can use hugetlb, right?
> >>>>>>>> They get a very good control over page size and pool preallocation etc.
> >>>>>>>> So they can get what they need - assuming there is enough memory.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I am still not convinced that 1G THP (TGP :) ) are really what we want
> >>>>>>> to support. I can understand that there are some use cases that might
> >>>>>>> benefit from it, especially:
> >>>>>>
> >>>>>> Well, I would say that internal support for larger huge pages (e.g. 1GB)
> >>>>>> that can transparently split under memory pressure is a useful
> >>>>>> funtionality. I cannot really judge how complex that would be
> >>>>>
> >>>>> Right, but that's then something different than serving (scarce,
> >>>>> unmovable) gigantic pages from CMA / reserved hugetlbfs pool. Nothing
> >>>>> wrong about *real* THP support, meaning, e.g., grouping consecutive
> >>>>> pages and converting them back and forth on demand. (E.g., 1GB ->
> >>>>> multiple 2MB -> multiple single pages), for example, when having to
> >>>>> migrate such a gigantic page. But that's very different from our
> >>>>> existing gigantic page code as far as I can tell.
> >>>>
> >>>> Serving 1GB PUD THPs from CMA is a compromise, since we do not want to
> >>>> bump MAX_ORDER to 20 to enable 1GB page allocation in buddy allocator,
> >>>> which needs section size increase. In addition, unmoveable pages cannot
> >>>> be allocated in CMA, so allocating 1GB pages has much higher chance from
> >>>> it than from ZONE_NORMAL.
> >>>
> >>> s/higher chances/non-zero chances
> >>
> >> Well, the longer the system runs (and consumes a significant amount of
> >> available main memory), the less likely it is.
> >>
> >>>
> >>> Currently we have nothing that prevents the fragmentation of the memory
> >>> with unmovable pages on the 1GB scale. It means that in a common case
> >>> it's highly unlikely to find a continuous GB without any unmovable page.
> >>> As now CMA seems to be the only working option.
> >>>
> >>
> >> And I completely dislike the use of CMA in this context (for example,
> >> allocating via CMA and freeing via the buddy by patching CMA when
> >> splitting up PUDs ...).
> >>
> >>> However it seems there are other use cases for the allocation of continuous
> >>> 1GB pages: e.g. secretfd ( https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_831628_&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=mdcwiGna7gQ4-RC_9XdaxFZ271PEQ09M0YtCcRoCkf8&s=4KlK2p0AVh1QdL8XDVeWyXPz4F63pdbbSCoxQlkNaa4&e=  ), where using
> >>> 1GB pages can reduce the fragmentation of the direct mapping.
> >>
> >> Yes, see RFC v1 where I already cced Mike.
> >>
> >>>
> >>> So I wonder if we need a new mechanism to avoid fragmentation on 1GB/PUD scale.
> >>> E.g. something like a second level of pageblocks. That would allow to group
> >>> all unmovable memory in few 1GB blocks and have more 1GB regions available for
> >>> gigantic THPs and other use cases. I'm looking now into how it can be done.
> >>
> >> Anything bigger than sections is somewhat problematic: you have to track
> >> that data somewhere. It cannot be the section (in contrast to pageblocks)
> > 
> > Well, it's not a large amount of data: the number of 1GB regions is not that
> > high even on very large machines.
> 
> Yes, but then you can have very sparse systems. And some use cases would
> actually want to avoid fragmentation on smaller levels (e.g., 128MB) -
> optimizing memory efficiency by turning off banks and such ...

It's a definitely a good question.

> > 
> >>
> >>> If anybody has any ideas here, I'll appreciate a lot.
> >>
> >> I already brought up the idea of ZONE_PREFER_MOVABLE (see RFC v1). That
> >> somewhat mimics what CMA does (when sized reasonably), works well with
> >> memory hot(un)plug, and is immune to misconfiguration. Within such a
> >> zone, we can try to optimize the placement of larger blocks.
> > 
> > Thank you for pointing at it!
> > 
> > The main problem with it is the same as with ZONE_MOVABLE: it does require
> > a boot-time educated guess on a good size. I admit that the CMA does too.
> 
> "Educated guess" of ratios like 1:1. 1:2, and even 1:4 (known from
> highmem times) ares usually perfectly fine. And if you mess up - in
> comparison to CMA - you won't shoot yourself in the foot, you get less
> gigantic pages - which is usually better than before. I consider that a
> clear win. Perfect? No. Can we be perfect? unlikely.

I'm not necessarily opposing your idea, I just think it will be tricky
to not introduce an additional overhead if the ratio is not perfectly
chosen. And there is simple a cost of adding a zone.

But fundamentally we're speaking about the same thing: grouping pages
by their movability on a smaller scale. With a new zone we'll split
pages into two parts with a fixed border, with new pageblock layer
in 1GB blocks.

I think the agreement is that we need such functionality.

Thanks!


  reply	other threads:[~2020-10-05 19:11 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
2020-09-28 17:53 ` [RFC PATCH v2 01/30] mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 02/30] mm: pagewalk: use READ_ONCE when reading the PMD " Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 03/30] mm: thp: use single linked list for THP page table page deposit Zi Yan
2020-09-28 19:34   ` Matthew Wilcox
2020-09-28 20:34     ` Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 04/30] mm: add new helper functions to allocate one PMD page with 512 PTE pages Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 05/30] mm: thp: add page table deposit/withdraw functions for PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 06/30] mm: change thp_order and thp_nr as we will have not just PMD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 07/30] mm: thp: add anonymous PUD THP page fault support without enabling it Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 08/30] mm: thp: add PUD THP support for copy_huge_pud Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 09/30] mm: thp: add PUD THP support to zap_huge_pud Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 10/30] fs: proc: add PUD THP kpageflag Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 11/30] mm: thp: handling PUD THP reference bit Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 12/30] mm: rmap: add mappped/unmapped page order to anonymous page rmap functions Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 13/30] mm: rmap: add map_order to page_remove_anon_compound_rmap Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 14/30] mm: thp: add PUD THP split_huge_pud_page() function Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 15/30] mm: thp: add PUD THP to deferred split list when PUD mapping is gone Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 16/30] mm: debug: adapt dump_page to PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 17/30] mm: thp: PUD THP COW splits PUD page and falls back to PMD page Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 18/30] mm: thp: PUD THP follow_p*d_page() support Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 19/30] mm: stats: make smap stats understand PUD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 20/30] mm: page_vma_walk: teach it about PMD-mapped PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 21/30] mm: thp: PUD THP support in try_to_unmap() Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 22/30] mm: thp: split PUD THPs at page reclaim Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 23/30] mm: support PUD THP pagemap support Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 24/30] mm: madvise: add page size options to MADV_HUGEPAGE and MADV_NOHUGEPAGE Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 25/30] mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37 Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 26/30] mm: thp: add a global knob to enable/disable PUD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 27/30] mm: thp: make PUD THP size public Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 28/30] hugetlb: cma: move cma reserve function to cma.c Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 29/30] mm: thp: use cma reservation for pud thp allocation Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 30/30] mm: thp: enable anonymous PUD THP at page fault path Zi Yan
2020-09-30 11:55 ` [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Michal Hocko
2020-10-01 15:14   ` Zi Yan
2020-10-02  7:32     ` Michal Hocko
2020-10-02  7:50       ` David Hildenbrand
2020-10-02  8:10         ` Michal Hocko
2020-10-02  8:30           ` David Hildenbrand
2020-10-05 15:03             ` Zi Yan
2020-10-05 15:55               ` Matthew Wilcox
2020-10-05 17:04                 ` Roman Gushchin
2020-10-05 19:12                 ` Zi Yan
2020-10-05 19:37                   ` Matthew Wilcox
2020-10-05 17:16               ` Roman Gushchin
2020-10-05 17:27                 ` David Hildenbrand
2020-10-05 18:25                   ` Roman Gushchin
2020-10-05 18:33                     ` David Hildenbrand
2020-10-05 19:11                       ` Roman Gushchin [this message]
2020-10-06  8:25                         ` David Hildenbrand
2020-10-05 17:39               ` David Hildenbrand
2020-10-05 18:05                 ` Zi Yan
2020-10-05 18:48                   ` David Hildenbrand
2020-10-06 11:59                   ` Michal Hocko
2020-10-05 15:34         ` Zi Yan
2020-10-05 17:30           ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201005191118.GB3001706@carbon.dhcp.thefacebook.com \
    --to=guro@fb.com \
    --cc=aarcange@redhat.com \
    --cc=david@redhat.com \
    --cc=dnellans@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=riel@surriel.com \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=william.kucharski@oracle.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).