All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>,
	Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>,
	Shakeel Butt <shakeelb@google.com>,
	Yang Shi <shy828301@gmail.com>, Jason Gunthorpe <jgg@nvidia.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	William Kucharski <william.kucharski@oracle.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	David Nellans <dnellans@nvidia.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64
Date: Mon, 5 Oct 2020 16:55:53 +0100	[thread overview]
Message-ID: <20201005155553.GM20115@casper.infradead.org> (raw)
In-Reply-To: <F3606096-EF9F-4F69-89DC-287095B649DC@nvidia.com>

On Mon, Oct 05, 2020 at 11:03:56AM -0400, Zi Yan wrote:
> On 2 Oct 2020, at 4:30, David Hildenbrand wrote:
> > Yes, I think one important feature would be that we don't end up placing
> > a gigantic page where only a handful of pages are actually populated
> > without green light from the application - because that's what some user
> > space applications care about (not consuming more memory than intended.
> > IIUC, this is also what this patch set does). I'm fine with placing
> > gigantic pages if it really just "defragments" the address space layout,
> > without filling unpopulated holes.
> >
> > Then, this would be mostly invisible to user space, and we really
> > wouldn't have to care about any configuration.
> 
> I agree that the interface should be as simple as no configuration to
> most users. But I also wonder why we have hugetlbfs to allow users to
> specify different kinds of page sizes, which seems against the discussion
> above. Are we assuming advanced users should always use hugetlbfs instead
> of THPs?

Evolution doesn't always produce the best outcomes ;-)

A perennial mistake we've made is "Oh, this is a strange & new & weird
feature that most applications will never care about, let's put it in
hugetlbfs where nobody will notice and we don't have to think about it
in the core VM"

And then what was initially strange & new & weird gradually becomes
something that most applications just want to have happen automatically,
and telling them all to go use hugetlbfs becomes untenable, so we move
the feature into the core VM.

It is absurd that my phone is attempting to manage a million 4kB pages.
I think even trying to manage a quarter-million 16kB pages is too much
work, and really it would be happier managing 65,000 64kB pages.

Extend that into the future a decade or two, and we'll be expecting
that it manages memory in megabyte sized units and uses PMD and PUD
mappings by default.  PTE mappings will still be used, but very much
on a "Oh you have a tiny file, OK, we'll fragment a megabyte page into
smaller pages to not waste too much memory when mapping it" basis.  So,
yeah, PUD sized mappings have problems today, but we should be writing
software now so a Pixel 15 in a decade can boot a kernel built five
years from now and have PUD mappings Just Work without requiring the
future userspace programmer to "use hugetlbfs".

One of the longer-term todo items is to support variable sized THPs for
anonymous memory, just like I've done for the pagecache.  With that in
place, I think scaling up from PMD sized pages to PUD sized pages starts
to look more natural.  Itanium and PA-RISC (two architectures that will
never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards.
The RiscV spec you pointed me at the other day confines itself to adding
support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB
sizes would be possible additions in the future.


But, back to today, what to do with this patchset?  Even on my 16GB
laptop, let alone my 4GB phone, I'm uncertain that allocating a 1GB
page is ever the right decision to make.  But my laptop runs a "mixed"
workload, and if you could convince me that Firefox would run 10% faster
by using a 1GB page as its in-memory cache, well, I'd be sold.

I do like having the kernel figure out what's in the best interests of the
system as a whole.  Apps don't have enough information, and while they
can provide hints, they're often wrong.  So, let's say an app maps 8GB
of anonymous memory.  As the app accesses it, we should probably start
by allocating 4kB pages to back that memory.  As time goes on and that
memory continues to be accessed and more memory is accessed, it makes
sense to keep track of that, replacing the existing 4kB pages with, say,
16-64kB pages and allocating newly accessed memory with larger pages.
Eventually that should grow to 2MB allocations and PMD mappings.
And then continue on, all the way to 1GB pages.

We also need to be able to figure out that it's not being effective
any more.  One of the issues with tracing accessed/dirty at the 1GB level
is that writing an entire 1GB page is going to take 0.25 seconds on a x4
gen3 PCIe link.  I know swapping sucks, but that's extreme.  So to use
1GB pages effectively today, we need to fragment them before choosing to
swap them out (*)  Maybe that's the point where we can start to say "OK,
this sized mapping might not be effective any more".  On the other hand,
that might not work for some situations.  Imagine, eg, a matrix multiply
(everybody's favourite worst-case scenario).  C = A * B where each of A,
B and C is too large to fit in DRAM.  There are going to be points of the
calculation where each element of A is going to be walked sequentially,
and so it'd be nice to use larger PTEs to map it, but then we need to
destroy that almost immediately to allow other things to use the memory.


I think I'm leaning towards not merging this patchset yet.  I'm in
agreement with the goals (allowing systems to use PUD-sized pages
automatically), but I think we need to improve the infrastructure to
make it work well automatically.  Does that make sense?

(*) It would be nice if hardware provided a way to track D/A on a sub-PTE
level when using PMD/PUD sized mappings.  I don't know of any that does
that today.

  reply	other threads:[~2020-10-05 15:56 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-28 17:53 [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Zi Yan
2020-09-28 17:53 ` [RFC PATCH v2 01/30] mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 02/30] mm: pagewalk: use READ_ONCE when reading the PMD " Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 03/30] mm: thp: use single linked list for THP page table page deposit Zi Yan
2020-09-28 19:34   ` Matthew Wilcox
2020-09-28 20:34     ` Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 04/30] mm: add new helper functions to allocate one PMD page with 512 PTE pages Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 05/30] mm: thp: add page table deposit/withdraw functions for PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 06/30] mm: change thp_order and thp_nr as we will have not just PMD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 07/30] mm: thp: add anonymous PUD THP page fault support without enabling it Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 08/30] mm: thp: add PUD THP support for copy_huge_pud Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 09/30] mm: thp: add PUD THP support to zap_huge_pud Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 10/30] fs: proc: add PUD THP kpageflag Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 11/30] mm: thp: handling PUD THP reference bit Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 12/30] mm: rmap: add mappped/unmapped page order to anonymous page rmap functions Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 13/30] mm: rmap: add map_order to page_remove_anon_compound_rmap Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 14/30] mm: thp: add PUD THP split_huge_pud_page() function Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 15/30] mm: thp: add PUD THP to deferred split list when PUD mapping is gone Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 16/30] mm: debug: adapt dump_page to PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 17/30] mm: thp: PUD THP COW splits PUD page and falls back to PMD page Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 18/30] mm: thp: PUD THP follow_p*d_page() support Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 19/30] mm: stats: make smap stats understand PUD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 20/30] mm: page_vma_walk: teach it about PMD-mapped PUD THP Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 21/30] mm: thp: PUD THP support in try_to_unmap() Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 22/30] mm: thp: split PUD THPs at page reclaim Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 23/30] mm: support PUD THP pagemap support Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 24/30] mm: madvise: add page size options to MADV_HUGEPAGE and MADV_NOHUGEPAGE Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 25/30] mm: vma: add VM_HUGEPAGE_PUD to vm_flags at bit 37 Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 26/30] mm: thp: add a global knob to enable/disable PUD THPs Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 27/30] mm: thp: make PUD THP size public Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 28/30] hugetlb: cma: move cma reserve function to cma.c Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 29/30] mm: thp: use cma reservation for pud thp allocation Zi Yan
2020-09-28 17:54 ` [RFC PATCH v2 30/30] mm: thp: enable anonymous PUD THP at page fault path Zi Yan
2020-09-30 11:55 ` [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64 Michal Hocko
2020-10-01 15:14   ` Zi Yan
2020-10-02  7:32     ` Michal Hocko
2020-10-02  7:50       ` David Hildenbrand
2020-10-02  8:10         ` Michal Hocko
2020-10-02  8:30           ` David Hildenbrand
2020-10-05 15:03             ` Zi Yan
2020-10-05 15:55               ` Matthew Wilcox [this message]
2020-10-05 17:04                 ` Roman Gushchin
2020-10-05 19:12                 ` Zi Yan
2020-10-05 19:37                   ` Matthew Wilcox
2020-10-05 17:16               ` Roman Gushchin
2020-10-05 17:27                 ` David Hildenbrand
2020-10-05 18:25                   ` Roman Gushchin
2020-10-05 18:33                     ` David Hildenbrand
2020-10-05 19:11                       ` Roman Gushchin
2020-10-06  8:25                         ` David Hildenbrand
2020-10-05 17:39               ` David Hildenbrand
2020-10-05 18:05                 ` Zi Yan
2020-10-05 18:48                   ` David Hildenbrand
2020-10-06 11:59                   ` Michal Hocko
2020-10-05 15:34         ` Zi Yan
2020-10-05 17:30           ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201005155553.GM20115@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=aarcange@redhat.com \
    --cc=david@redhat.com \
    --cc=dnellans@nvidia.com \
    --cc=guro@fb.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=riel@surriel.com \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=william.kucharski@oracle.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.