All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [GIT PULL] Memory folios for v5.15
Date: Mon, 30 Aug 2021 16:27:04 -0400	[thread overview]
Message-ID: <YS0/GHBG15+2Mglk@cmpxchg.org> (raw)
In-Reply-To: <YS0h4cFhwYoW3MBI@casper.infradead.org>

On Mon, Aug 30, 2021 at 07:22:25PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 30, 2021 at 01:32:55PM -0400, Johannes Weiner wrote:
> > > The mistake you're making is coupling "minimum mapping granularity" with
> > > "minimum allocation granularity".  We can happily build a system which
> > > only allocates memory on 2MB boundaries and yet lets you map that memory
> > > to userspace in 4kB granules.
> > 
> > Yeah, but I want to do it without allocating 4k granule descriptors
> > statically at boot time for the entirety of available memory.
> 
> Even that is possible when bumping the PAGE_SIZE to 16kB.  It needs a
> bit of fiddling:
> 
> static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
>                         unsigned long addr, struct page *page, pgprot_t prot)
> {
>         if (!pte_none(*pte))
>                 return -EBUSY;
>         /* Ok, finally just insert the thing.. */
>         get_page(page);
>         inc_mm_counter_fast(mm, mm_counter_file(page));
>         page_add_file_rmap(page, false);
>         set_pte_at(mm, addr, pte, mk_pte(page, prot));
>         return 0;
> }
> 
> mk_pte() assumes that a struct page refers to a single pte.  If we
> revamped it to take (page, offset, prot), it could construct the
> appropriate pte for the offset within that page.

Right, page tables only need a pfn. The struct page is for us to
maintain additional state about the object.

For the objects that are subpage sized, we should be able to hold that
state (shrinker lru linkage, referenced bit, dirtiness, ...) inside
ad-hoc allocated descriptors.

Descriptors which could well be what struct folio {} is today, IMO. As
long as it doesn't innately assume, or will assume, in the API the
1:1+ mapping to struct page that is inherent to the compound page.

> Independent of _that_, the biggest problem we face (I think) in getting
> rid of memmap is that it offers the pfn_to_page() lookup.  If we move to a
> dynamically allocated descriptor for our arbitrarily-sized memory objects,
> we need a tree to store them in.  Given the trees we currently have,
> our best bet is probably the radix tree, but I dislike its glass jaws.
> I'm hoping that (again) the maple tree becomes stable soon enough for
> us to dynamically allocate memory descriptors and store them in it.
> And that we don't discover a bootstrapping problem between kmalloc()
> (for tree nodes) and memmap (to look up the page associated with a node).
> 
> But that's all a future problem and if we can't even take a first step
> to decouple filesystems from struct page then working towards that would
> be wasted effort.

Agreed. Again, I'm just advocating to keep the doors open on that, and
avoid the situation where the filesystem folks run off and convert to
a flexible folio data structure, and the MM people run off and convert
all compound pages to folio and in the process hardcode assumptions
and turn it basically into struct page again that can't easily change.

> > > > Willy says he has future ideas to make compound pages scale. But we
> > > > have years of history saying this is incredibly hard to achieve - and
> > > > it certainly wasn't for a lack of constant trying.
> > > 
> > > I genuinely don't understand.  We have five primary users of memory
> > > in Linux (once we're in a steady state after boot):
> > > 
> > >  - Anonymous memory
> > >  - File-backed memory
> > >  - Slab
> > >  - Network buffers
> > >  - Page tables
> > > 
> > > The relative importance of each one very much depends on your workload.
> > > Slab already uses medium order pages and can be made to use larger.
> > > Folios should give us large allocations of file-backed memory and
> > > eventually anonymous memory.  Network buffers seem to be headed towards
> > > larger allocations too.  Page tables will need some more thought, but
> > > once we're no longer interleaving file cache pages, anon pages and
> > > page tables, they become less of a problem to deal with.
> > > 
> > > Once everybody's allocating order-4 pages, order-4 pages become easy
> > > to allocate.  When everybody's allocating order-0 pages, order-4 pages
> > > require the right 16 pages to come available, and that's really freaking
> > > hard.
> > 
> > Well yes, once (and iff) everybody is doing that. But for the
> > foreseeable future we're expecting to stay in a world where the
> > *majority* of memory is in larger chunks, while we continue to see 4k
> > cache entries, anon pages, and corresponding ptes, yes?
> 
> No.  4k page table entries are demanded by the architecture, and there's
> little we can do about that.

I wasn't claiming otherwise..?

> > Memory is dominated by larger allocations from the main workloads, but
> > we'll continue to have a base system that does logging, package
> > upgrades, IPC stuff, has small config files, small libraries, small
> > executables. It'll be a while until we can raise the floor on those
> > much smaller allocations - if ever.
> > 
> > So we need a system to manage them living side by side.
> > 
> > The slab allocator has proven to be an excellent solution to this
> > problem, because the mailing lists are not flooded with OOM reports
> > where smaller allocations fragmented the 4k page space. And even large
> > temporary slab explosions (inodes, dentries etc.) are usually pushed
> > back with fairly reasonable CPU overhead.
> 
> You may not see the bug reports, but they exist.  Right now, we have
> a service that is echoing 2 to drop_caches every hour on systems which
> are lightly loaded, otherwise the dcache swamps the entire machine and
> takes hours or days to come back under control.

Sure, but compare that to the number of complaints about higher-order
allocations failing or taking too long (THP in the fault path e.g.)...

Typegrouping isn't infallible for fighting fragmentation, but it seems
to be good enough for most cases. Unlike the buddy allocator.

  reply	other threads:[~2021-08-30 20:25 UTC|newest]

Thread overview: 175+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-23 19:01 [GIT PULL] Memory folios for v5.15 Matthew Wilcox
2021-08-23 21:26 ` Johannes Weiner
2021-08-23 22:06   ` Linus Torvalds
2021-08-23 22:06     ` Linus Torvalds
2021-08-24  2:20     ` Matthew Wilcox
2021-08-24 13:04     ` Matthew Wilcox
2021-08-23 22:15   ` Matthew Wilcox
2021-08-24 18:32     ` Johannes Weiner
2021-08-24 18:59       ` Linus Torvalds
2021-08-24 18:59         ` Linus Torvalds
2021-08-25  6:39         ` Christoph Hellwig
2021-08-24 19:44       ` Matthew Wilcox
2021-08-25 15:13         ` Johannes Weiner
2021-08-26  0:45           ` Darrick J. Wong
2021-08-27 14:07             ` Johannes Weiner
2021-08-27 18:44               ` Matthew Wilcox
2021-08-27 21:41                 ` Dan Williams
2021-08-27 21:41                   ` Dan Williams
2021-08-27 21:49                   ` Matthew Wilcox
2021-08-30 17:32                 ` Johannes Weiner
2021-08-30 18:22                   ` Matthew Wilcox
2021-08-30 20:27                     ` Johannes Weiner [this message]
2021-08-30 21:38                       ` Matthew Wilcox
2021-08-31 17:40                         ` Vlastimil Babka
2021-09-01 17:43                         ` Johannes Weiner
2021-09-02 15:13                           ` Zi Yan
2021-09-06 14:00                             ` Vlastimil Babka
2021-08-31 18:50                       ` Eric W. Biederman
2021-08-31 18:50                         ` Eric W. Biederman
2021-08-26  8:58         ` David Howells
2021-08-27 10:03           ` Johannes Weiner
2021-08-27 12:05             ` Matthew Wilcox
2021-08-27 10:49           ` David Howells
2021-08-24 15:54   ` David Howells
2021-08-24 17:56     ` Matthew Wilcox
2021-08-24 18:26       ` Linus Torvalds
2021-08-24 18:26         ` Linus Torvalds
2021-08-24 18:29         ` Linus Torvalds
2021-08-24 18:29           ` Linus Torvalds
2021-08-24 19:26           ` Theodore Ts'o
2021-08-24 19:34           ` David Howells
2021-08-24 20:02             ` Theodore Ts'o
2021-08-24 21:32             ` David Howells
2021-08-25 12:08               ` Jeff Layton
2021-08-24 19:01         ` Matthew Wilcox
2021-08-24 19:11           ` Linus Torvalds
2021-08-24 19:11             ` Linus Torvalds
2021-08-24 19:23             ` Matthew Wilcox
2021-08-24 19:44               ` Theodore Ts'o
2021-08-24 20:00                 ` Matthew Wilcox
2021-08-25  6:32                 ` Christoph Hellwig
2021-08-25  9:01                   ` Rasmus Villemoes
2021-08-26  6:32                     ` Amir Goldstein
2021-08-26  6:32                       ` Amir Goldstein
2021-08-25 12:03                   ` Jeff Layton
2021-08-25 12:03                     ` Jeff Layton
2021-08-26  0:59                     ` Darrick J. Wong
2021-08-26  4:02                   ` Nicholas Piggin
2021-09-01 12:58                 ` Mike Rapoport
2021-08-24 19:35             ` David Howells
2021-08-24 20:35               ` Vlastimil Babka
2021-08-24 20:40                 ` Vlastimil Babka
2021-08-24 19:11         ` David Howells
2021-08-24 19:25           ` Linus Torvalds
2021-08-24 19:25             ` Linus Torvalds
2021-08-24 19:38             ` Linus Torvalds
2021-08-24 19:38               ` Linus Torvalds
2021-08-24 19:48               ` Linus Torvalds
2021-08-24 19:48                 ` Linus Torvalds
2021-08-26 17:18                 ` Matthew Wilcox
2021-08-24 19:59             ` David Howells
2021-10-05 13:52   ` Matthew Wilcox
2021-10-05 17:29     ` Johannes Weiner
2021-10-05 17:32       ` David Hildenbrand
2021-10-05 18:30       ` Matthew Wilcox
2021-10-05 19:56         ` Jason Gunthorpe
2021-08-28  3:29 ` Matthew Wilcox
2021-09-09 12:43 ` Christoph Hellwig
2021-09-09 13:56   ` Vlastimil Babka
2021-09-09 18:16     ` Johannes Weiner
2021-09-09 18:44       ` Matthew Wilcox
2021-09-09 22:03         ` Johannes Weiner
2021-09-09 22:48           ` Matthew Wilcox
2021-09-09 19:17     ` John Hubbard
2021-09-09 19:23       ` Matthew Wilcox
2021-09-10 20:16 ` Folio discussion recap Kent Overstreet
2021-09-11  1:23   ` Kirill A. Shutemov
2021-09-13 11:32     ` Michal Hocko
2021-09-13 18:12       ` Johannes Weiner
2021-09-15 15:40   ` Johannes Weiner
2021-09-15 17:55     ` Damian Tometzki
2021-09-16  2:58     ` Darrick J. Wong
2021-09-16 16:54       ` Johannes Weiner
2021-09-17  5:24         ` Dave Chinner
2021-09-17  7:18           ` Christoph Hellwig
2021-09-17 16:31           ` Johannes Weiner
2021-09-17 20:57             ` Kirill A. Shutemov
2021-09-17 21:17               ` Kent Overstreet
2021-09-17 22:02                 ` Kirill A. Shutemov
2021-09-17 22:21                   ` Kent Overstreet
2021-09-17 23:15               ` Johannes Weiner
2021-09-20 10:03                 ` Kirill A. Shutemov
2021-09-17 21:13             ` Kent Overstreet
2021-09-17 22:25               ` Theodore Ts'o
2021-09-17 23:35                 ` Josef Bacik
2021-09-18  1:04             ` Dave Chinner
2021-09-18  4:51               ` Kent Overstreet
2021-09-20  1:04                 ` Dave Chinner
2021-09-16 21:58       ` David Howells
2021-09-20  2:17   ` Matthew Wilcox
2021-09-21 19:47     ` Johannes Weiner
2021-09-21 20:38       ` Matthew Wilcox
2021-09-21 21:11         ` Kent Overstreet
2021-09-21 21:22           ` Folios for 5.15 request - Was: re: Folio discussion recap - Kent Overstreet
2021-09-22 15:08             ` Johannes Weiner
2021-09-22 15:46               ` Kent Overstreet
2021-09-22 16:26                 ` Matthew Wilcox
2021-09-22 16:56                   ` Chris Mason
2021-09-22 19:54                     ` Matthew Wilcox
2021-09-22 20:15                       ` Kent Overstreet
2021-09-22 20:21                       ` Linus Torvalds
2021-09-22 20:21                         ` Linus Torvalds
2021-09-23  5:42               ` Kent Overstreet
2021-09-23 18:00                 ` Johannes Weiner
2021-09-23 19:31                   ` Matthew Wilcox
2021-09-23 20:20                   ` Kent Overstreet
2021-10-16  3:28               ` Matthew Wilcox
2021-10-18 16:47                 ` Johannes Weiner
2021-10-18 18:12                   ` Kent Overstreet
2021-10-18 20:45                     ` Johannes Weiner
2021-10-19 16:11                       ` Splitting struct page into multiple types " Kent Overstreet
2021-10-19 17:06                         ` Gao Xiang
2021-10-19 17:34                           ` Matthew Wilcox
2021-10-19 17:54                             ` Gao Xiang
2021-10-20 17:46                               ` Kent Overstreet
2021-10-19 17:37                         ` Jason Gunthorpe
2021-10-19 21:14                       ` David Howells
2021-10-18 18:28                   ` Folios for 5.15 request " Matthew Wilcox
2021-10-18 21:56                     ` Johannes Weiner
2021-10-18 23:16                       ` Kirill A. Shutemov
2021-10-19 15:16                         ` Johannes Weiner
2021-10-20  3:19                           ` Matthew Wilcox
2021-10-20  7:50                           ` David Hildenbrand
2021-10-20 17:26                             ` Matthew Wilcox
2021-10-20 18:04                               ` David Hildenbrand
2021-10-21  6:51                                 ` Christoph Hellwig
2021-10-21  7:21                                   ` David Hildenbrand
2021-10-21 12:03                                     ` Kent Overstreet
2021-10-21 12:35                                       ` David Hildenbrand
2021-10-21 12:38                                         ` Christoph Hellwig
2021-10-21 13:00                                           ` David Hildenbrand
2021-10-21 12:41                                         ` Matthew Wilcox
2021-10-20 17:39                           ` Kent Overstreet
2021-10-21 21:37                             ` Johannes Weiner
2021-10-22  1:52                               ` Matthew Wilcox
2021-10-22  7:59                                 ` David Hildenbrand
2021-10-22 13:01                                   ` Matthew Wilcox
2021-10-22 14:40                                     ` David Hildenbrand
2021-10-23  2:22                                       ` Matthew Wilcox
2021-10-23  5:02                                         ` Christoph Hellwig
2021-10-23  9:58                                         ` David Hildenbrand
2021-10-23 16:00                                           ` Kent Overstreet
2021-10-23 21:41                                             ` Matthew Wilcox
2021-10-23 22:23                                               ` Kent Overstreet
2021-10-25 15:35                                 ` Johannes Weiner
2021-10-25 15:52                                   ` Matthew Wilcox
2021-10-25 16:05                                   ` Kent Overstreet
2021-10-16 19:07               ` Matthew Wilcox
2021-10-18 17:25                 ` Johannes Weiner
2021-09-21 22:18           ` Folio discussion recap Matthew Wilcox
2021-09-23  0:45             ` Ira Weiny
2021-09-23  3:41               ` Matthew Wilcox
2021-09-23 22:12                 ` Ira Weiny
2021-09-29 15:24                   ` Matthew Wilcox
2021-09-21 21:59         ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YS0/GHBG15+2Mglk@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=djwong@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.