linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [GIT PULL] Memory folios for v5.15
Date: Mon, 30 Aug 2021 22:38:20 +0100	[thread overview]
Message-ID: <YS1PzKLr2AWenbHF@casper.infradead.org> (raw)
In-Reply-To: <YS0/GHBG15+2Mglk@cmpxchg.org>

On Mon, Aug 30, 2021 at 04:27:04PM -0400, Johannes Weiner wrote:
> Right, page tables only need a pfn. The struct page is for us to
> maintain additional state about the object.
> 
> For the objects that are subpage sized, we should be able to hold that
> state (shrinker lru linkage, referenced bit, dirtiness, ...) inside
> ad-hoc allocated descriptors.
> 
> Descriptors which could well be what struct folio {} is today, IMO. As
> long as it doesn't innately assume, or will assume, in the API the
> 1:1+ mapping to struct page that is inherent to the compound page.

Maybe this is where we fundamentally disagree.  I don't think there's
any point in *managing* memory in a different size from that in which it
is *allocated*.  There's no point in tracking dirtiness, LRU position,
locked, etc, etc in different units from allocation size.  The point of
tracking all these things is so we can allocate and free memory.  If
a 'cache descriptor' reaches the end of the LRU and should be reclaimed,
that's wasted effort in tracking if the rest of the 'cache descriptor'
is dirty and heavily in use.  So a 'cache descriptor' should always be
at least a 'struct page' in size (assuming you're using 'struct page'
to mean "the size of the smallest allocation unit from the page
allocator")

> > > > I genuinely don't understand.  We have five primary users of memory
> > > > in Linux (once we're in a steady state after boot):
> > > > 
> > > >  - Anonymous memory
> > > >  - File-backed memory
> > > >  - Slab
> > > >  - Network buffers
> > > >  - Page tables
> > > > 
> > > > The relative importance of each one very much depends on your workload.
> > > > Slab already uses medium order pages and can be made to use larger.
> > > > Folios should give us large allocations of file-backed memory and
> > > > eventually anonymous memory.  Network buffers seem to be headed towards
> > > > larger allocations too.  Page tables will need some more thought, but
> > > > once we're no longer interleaving file cache pages, anon pages and
> > > > page tables, they become less of a problem to deal with.
> > > > 
> > > > Once everybody's allocating order-4 pages, order-4 pages become easy
> > > > to allocate.  When everybody's allocating order-0 pages, order-4 pages
> > > > require the right 16 pages to come available, and that's really freaking
> > > > hard.
> > > 
> > > Well yes, once (and iff) everybody is doing that. But for the
> > > foreseeable future we're expecting to stay in a world where the
> > > *majority* of memory is in larger chunks, while we continue to see 4k
> > > cache entries, anon pages, and corresponding ptes, yes?
> > 
> > No.  4k page table entries are demanded by the architecture, and there's
> > little we can do about that.
> 
> I wasn't claiming otherwise..?

You snipped the part of my paragraph that made the 'No' make sense.
I'm agreeing that page tables will continue to be a problem, but
everything else (page cache, anon, networking, slab) I expect to be
using higher order allocations within the next year.

> > > The slab allocator has proven to be an excellent solution to this
> > > problem, because the mailing lists are not flooded with OOM reports
> > > where smaller allocations fragmented the 4k page space. And even large
> > > temporary slab explosions (inodes, dentries etc.) are usually pushed
> > > back with fairly reasonable CPU overhead.
> > 
> > You may not see the bug reports, but they exist.  Right now, we have
> > a service that is echoing 2 to drop_caches every hour on systems which
> > are lightly loaded, otherwise the dcache swamps the entire machine and
> > takes hours or days to come back under control.
> 
> Sure, but compare that to the number of complaints about higher-order
> allocations failing or taking too long (THP in the fault path e.g.)...

Oh, we have those bug reports too ...

> Typegrouping isn't infallible for fighting fragmentation, but it seems
> to be good enough for most cases. Unlike the buddy allocator.

You keep saying that the buddy allocator isn't given enough information to
do any better, but I think it is.  Page cache and anon memory are marked
with GFP_MOVABLE.  Slab, network and page tables aren't.  Is there a
reason that isn't enough?

I think something that might actually help is if we added a pair of new
GFP flags, __GFP_FAST and __GFP_DENSE.  Dense allocations are those which
are expected to live for a long time, and so the page allocator should
try to group them with other dense allocations.  Slab and page tables
should use DENSE, along with things like superblocks, or fs bitmaps where
the speed of allocation is almost unimportant, but attempting to keep
them out of the way of other allocations is useful.  Fast allocations
are for allocations which should not live for very long.  The speed of
allocation dominates, and it's OK if the allocation gets in the way of
defragmentation for a while.

An example of another allocator that could care about DENSE vs FAST
would be vmalloc.  Today, it does:

        if (array_size > PAGE_SIZE) {
                area->pages = __vmalloc_node(array_size, 1, nested_gfp, node,
                                        area->caller);
        } else {
                area->pages = kmalloc_node(array_size, nested_gfp, node);
        }

That's actually pretty bad; if you have, say, a 768kB vmalloc space,
you need a 12kB array.  We currently allocate 16kB for the array, when we
could use alloc_pages_exact() to free the 4kB we're never going to use.
If this is GFP_DENSE, we know it's a long-lived allocation and we can
let somebody else use the extra 4kB.  If it's not, it's probably not
worth bothering with.

  reply	other threads:[~2021-08-30 21:38 UTC|newest]

Thread overview: 162+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-23 19:01 [GIT PULL] Memory folios for v5.15 Matthew Wilcox
2021-08-23 21:26 ` Johannes Weiner
2021-08-23 22:06   ` Linus Torvalds
2021-08-24  2:20     ` Matthew Wilcox
2021-08-24 13:04     ` Matthew Wilcox
2021-08-23 22:15   ` Matthew Wilcox
2021-08-24 18:32     ` Johannes Weiner
2021-08-24 18:59       ` Linus Torvalds
2021-08-25  6:39         ` Christoph Hellwig
2021-08-24 19:44       ` Matthew Wilcox
2021-08-25 15:13         ` Johannes Weiner
2021-08-26  0:45           ` Darrick J. Wong
2021-08-27 14:07             ` Johannes Weiner
2021-08-27 18:44               ` Matthew Wilcox
2021-08-27 21:41                 ` Dan Williams
2021-08-27 21:49                   ` Matthew Wilcox
2021-08-30 17:32                 ` Johannes Weiner
2021-08-30 18:22                   ` Matthew Wilcox
2021-08-30 20:27                     ` Johannes Weiner
2021-08-30 21:38                       ` Matthew Wilcox [this message]
2021-08-31 17:40                         ` Vlastimil Babka
2021-09-01 17:43                         ` Johannes Weiner
2021-09-02 15:13                           ` Zi Yan
2021-09-06 14:00                             ` Vlastimil Babka
2021-08-31 18:50                       ` Eric W. Biederman
2021-08-26  8:58         ` David Howells
2021-08-27 10:03           ` Johannes Weiner
2021-08-27 12:05             ` Matthew Wilcox
2021-08-27 10:49           ` David Howells
2021-08-24 15:54   ` David Howells
2021-08-24 17:56     ` Matthew Wilcox
2021-08-24 18:26       ` Linus Torvalds
2021-08-24 18:29         ` Linus Torvalds
2021-08-24 19:26           ` Theodore Ts'o
2021-08-24 19:34           ` David Howells
2021-08-24 20:02             ` Theodore Ts'o
2021-08-24 21:32             ` David Howells
2021-08-25 12:08               ` Jeff Layton
2021-08-24 19:01         ` Matthew Wilcox
2021-08-24 19:11           ` Linus Torvalds
2021-08-24 19:23             ` Matthew Wilcox
2021-08-24 19:44               ` Theodore Ts'o
2021-08-24 20:00                 ` Matthew Wilcox
2021-08-25  6:32                 ` Christoph Hellwig
2021-08-25  9:01                   ` Rasmus Villemoes
2021-08-26  6:32                     ` Amir Goldstein
2021-08-25 12:03                   ` Jeff Layton
2021-08-26  0:59                     ` Darrick J. Wong
2021-08-26  4:02                   ` Nicholas Piggin
2021-09-01 12:58                 ` Mike Rapoport
2021-08-24 19:35             ` David Howells
2021-08-24 20:35               ` Vlastimil Babka
2021-08-24 20:40                 ` Vlastimil Babka
2021-08-24 19:11         ` David Howells
2021-08-24 19:25           ` Linus Torvalds
2021-08-24 19:38             ` Linus Torvalds
2021-08-24 19:48               ` Linus Torvalds
2021-08-26 17:18                 ` Matthew Wilcox
2021-08-24 19:59             ` David Howells
2021-10-05 13:52   ` Matthew Wilcox
2021-10-05 17:29     ` Johannes Weiner
2021-10-05 17:32       ` David Hildenbrand
2021-10-05 18:30       ` Matthew Wilcox
2021-10-05 19:56         ` Jason Gunthorpe
2021-08-28  3:29 ` Matthew Wilcox
2021-09-09 12:43 ` Christoph Hellwig
2021-09-09 13:56   ` Vlastimil Babka
2021-09-09 18:16     ` Johannes Weiner
2021-09-09 18:44       ` Matthew Wilcox
2021-09-09 22:03         ` Johannes Weiner
2021-09-09 22:48           ` Matthew Wilcox
2021-09-09 19:17     ` John Hubbard
2021-09-09 19:23       ` Matthew Wilcox
2021-09-10 20:16 ` Folio discussion recap Kent Overstreet
2021-09-11  1:23   ` Kirill A. Shutemov
2021-09-13 11:32     ` Michal Hocko
2021-09-13 18:12       ` Johannes Weiner
2021-09-15 15:40   ` Johannes Weiner
2021-09-15 17:55     ` Damian Tometzki
2021-09-16  2:58     ` Darrick J. Wong
2021-09-16 16:54       ` Johannes Weiner
2021-09-17  5:24         ` Dave Chinner
2021-09-17  7:18           ` Christoph Hellwig
2021-09-17 16:31           ` Johannes Weiner
2021-09-17 20:57             ` Kirill A. Shutemov
2021-09-17 21:17               ` Kent Overstreet
2021-09-17 22:02                 ` Kirill A. Shutemov
2021-09-17 22:21                   ` Kent Overstreet
2021-09-17 23:15               ` Johannes Weiner
2021-09-20 10:03                 ` Kirill A. Shutemov
2021-09-17 21:13             ` Kent Overstreet
2021-09-17 22:25               ` Theodore Ts'o
2021-09-17 23:35                 ` Josef Bacik
2021-09-18  1:04             ` Dave Chinner
2021-09-18  4:51               ` Kent Overstreet
2021-09-20  1:04                 ` Dave Chinner
2021-09-16 21:58       ` David Howells
2021-09-20  2:17   ` Matthew Wilcox
2021-09-21 19:47     ` Johannes Weiner
2021-09-21 20:38       ` Matthew Wilcox
2021-09-21 21:11         ` Kent Overstreet
2021-09-21 21:22           ` Folios for 5.15 request - Was: re: Folio discussion recap - Kent Overstreet
2021-09-22 15:08             ` Johannes Weiner
2021-09-22 15:46               ` Kent Overstreet
2021-09-22 16:26                 ` Matthew Wilcox
2021-09-22 16:56                   ` Chris Mason
2021-09-22 19:54                     ` Matthew Wilcox
2021-09-22 20:15                       ` Kent Overstreet
2021-09-22 20:21                       ` Linus Torvalds
2021-09-23  5:42               ` Kent Overstreet
2021-09-23 18:00                 ` Johannes Weiner
2021-09-23 19:31                   ` Matthew Wilcox
2021-09-23 20:20                   ` Kent Overstreet
2021-10-16  3:28               ` Matthew Wilcox
2021-10-18 16:47                 ` Johannes Weiner
2021-10-18 18:12                   ` Kent Overstreet
2021-10-18 20:45                     ` Johannes Weiner
2021-10-19 16:11                       ` Splitting struct page into multiple types " Kent Overstreet
2021-10-19 17:06                         ` Gao Xiang
2021-10-19 17:34                           ` Matthew Wilcox
2021-10-19 17:54                             ` Gao Xiang
2021-10-20 17:46                               ` Kent Overstreet
2021-10-19 17:37                         ` Jason Gunthorpe
2021-10-19 21:14                       ` David Howells
2021-10-18 18:28                   ` Folios for 5.15 request " Matthew Wilcox
2021-10-18 21:56                     ` Johannes Weiner
2021-10-18 23:16                       ` Kirill A. Shutemov
2021-10-19 15:16                         ` Johannes Weiner
2021-10-20  3:19                           ` Matthew Wilcox
2021-10-20  7:50                           ` David Hildenbrand
2021-10-20 17:26                             ` Matthew Wilcox
2021-10-20 18:04                               ` David Hildenbrand
2021-10-21  6:51                                 ` Christoph Hellwig
2021-10-21  7:21                                   ` David Hildenbrand
2021-10-21 12:03                                     ` Kent Overstreet
2021-10-21 12:35                                       ` David Hildenbrand
2021-10-21 12:38                                         ` Christoph Hellwig
2021-10-21 13:00                                           ` David Hildenbrand
2021-10-21 12:41                                         ` Matthew Wilcox
2021-10-20 17:39                           ` Kent Overstreet
2021-10-21 21:37                             ` Johannes Weiner
2021-10-22  1:52                               ` Matthew Wilcox
2021-10-22  7:59                                 ` David Hildenbrand
2021-10-22 13:01                                   ` Matthew Wilcox
2021-10-22 14:40                                     ` David Hildenbrand
2021-10-23  2:22                                       ` Matthew Wilcox
2021-10-23  5:02                                         ` Christoph Hellwig
2021-10-23  9:58                                         ` David Hildenbrand
2021-10-23 16:00                                           ` Kent Overstreet
2021-10-23 21:41                                             ` Matthew Wilcox
2021-10-23 22:23                                               ` Kent Overstreet
2021-10-25 15:35                                 ` Johannes Weiner
2021-10-25 15:52                                   ` Matthew Wilcox
2021-10-25 16:05                                   ` Kent Overstreet
2021-10-16 19:07               ` Matthew Wilcox
2021-10-18 17:25                 ` Johannes Weiner
2021-09-21 22:18           ` Folio discussion recap Matthew Wilcox
2021-09-23  0:45             ` Ira Weiny
2021-09-23  3:41               ` Matthew Wilcox
2021-09-23 22:12                 ` Ira Weiny
2021-09-29 15:24                   ` Matthew Wilcox
2021-09-21 21:59         ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YS1PzKLr2AWenbHF@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=djwong@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).