All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [GIT PULL] Memory folios for v5.15
Date: Wed, 25 Aug 2021 11:13:45 -0400	[thread overview]
Message-ID: <YSZeKfHxOkEAri1q@cmpxchg.org> (raw)
In-Reply-To: <YSVMAS2pQVq+xma7@casper.infradead.org>

On Tue, Aug 24, 2021 at 08:44:01PM +0100, Matthew Wilcox wrote:
> On Tue, Aug 24, 2021 at 02:32:56PM -0400, Johannes Weiner wrote:
> > The folio doc says "It is at least as large as %PAGE_SIZE";
> > folio_order() says "A folio is composed of 2^order pages";
> > page_folio(), folio_pfn(), folio_nr_pages all encode a N:1
> > relationship. And yes, the name implies it too.
> > 
> > This is in direct conflict with what I'm talking about, where base
> > page granularity could become coarser than file cache granularity.
> 
> That doesn't make any sense.  A page is the fundamental unit of the
> mm.  Why would we want to increase the granularity of page allocation
> and not increase the granularity of the file cache?

I'm not sure why one should be tied to the other. The folio itself is
based on the premise that a cache entry doesn't have to correspond to
exactly one struct page. And I agree with that. I'm just wondering why
it continues to imply a cache entry is at least one full page, rather
than saying a cache entry is a set of bytes that can be backed however
the MM sees fit. So that in case we do bump struct page size in the
future we don't have to redo the filesystem interface again.

I've listed reasons why 4k pages are increasingly the wrong choice for
many allocations, reclaim and paging. We also know there is a need to
maintain support for 4k cache entries.

> > Are we going to bump struct page to 2M soon? I don't know. Here is
> > what I do know about 4k pages, though:
> > 
> > - It's a lot of transactional overhead to manage tens of gigs of
> >   memory in 4k pages. We're reclaiming, paging and swapping more than
> >   ever before in our DCs, because flash provides in abundance the
> >   low-latency IOPS required for that, and parking cold/warm workload
> >   memory on cheap flash saves expensive RAM. But we're continously
> >   scanning thousands of pages per second to do this. There was also
> >   the RWF_UNCACHED thread around reclaim CPU overhead at the higher
> >   end of buffered IO rates. There is the fact that we have a pending
> >   proposal from Google to replace rmap because it's too CPU-intense
> >   when paging into compressed memory pools.
> 
> This seems like an argument for folios, not against them.  If user
> memory (both anon and file) is being allocated in larger chunks, there
> are fewer pages to scan, less book-keeping to do, and all you're paying
> for that is I/O bandwidth.

Well, it's an argument for huge pages, and we already have those in
the form of THP.

The problem with THP today is that the page allocator fragments the
physical address space at the 4k granularity per default, and groups
random allocations with no type information and rudimentary
lifetime/reclaimability hints together.

I'm having a hard time seeing 2M allocations scale as long as we do
this. As opposed to making 2M the default block and using slab-style
physical grouping by type and instantiation time for smaller cache
entries - to improve the chances of physically contiguous reclaim.

But because folios are compound/head pages first and foremost, they
are inherently tied to being multiples of PAGE_SIZE.

> > - It's a lot of internal fragmentation. Compaction is becoming the
> >   default method for allocating the majority of memory in our
> >   servers. This is a latency concern during page faults, and a
> >   predictability concern when we defer it to khugepaged collapsing.
> 
> Again, the more memory that we allocate in higher-order chunks, the
> better this situation becomes.

It only needs 1 unfortunately placed 4k page out of 512 to mess up a
2M block indefinitely. And the page allocator has little awareness
whether the 4k page it's handing out to somebody pairs well with the
4k page adjacent to it in terms of type and lifetime.

> > - struct page is statically eating gigs of expensive memory on every
> >   single machine, when only some of our workloads would require this
> >   level of granularity for some of their memory. And that's *after*
> >   we're fighting over every bit in that structure.
> 
> That, folios does not help with.  I have post-folio ideas about how
> to address that, but I can't realistically start working on them
> until folios are upstream.

How would you reduce the memory overhead of struct page without losing
necessary 4k granularity at the cache level? As long as folio implies
that cache entries can't be smaller than a struct page?

I appreciate folio is a big patchset and I don't mean to get too much
into speculation about the future.

But we're here in part because the filesystems have been too exposed
to the backing memory implementation details. So all I'm saying is, if
you're touching all the file cache interface now anyway, why not use
the opportunity to properly disconnect it from the reality of pages,
instead of making the compound page the new interface for filesystems.

What's wrong with the idea of a struct cache_entry which can be
embedded wherever we want: in a page, a folio or a pageset. Or in the
future allocated on demand for <PAGE_SIZE entries, if need be. But
actually have it be just a cache entry for the fs to read and write,
not also a compound page and an anon page etc. all at the same time.

Even today that would IMO delineate more clearly between the file
cache data plane and the backing memory plane. It doesn't get in the
way of also fixing the base-or-compound mess inside MM code with
folio/pageset, either.

And if down the line we change how the backing memory is implemented,
the changes would be a more manageable scope inside MM proper.

Anyway, I think I've asked all this before and don't mean to harp on
it if people generally disagree that this is a concern.

  reply	other threads:[~2021-08-25 15:12 UTC|newest]

Thread overview: 175+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-23 19:01 [GIT PULL] Memory folios for v5.15 Matthew Wilcox
2021-08-23 21:26 ` Johannes Weiner
2021-08-23 22:06   ` Linus Torvalds
2021-08-23 22:06     ` Linus Torvalds
2021-08-24  2:20     ` Matthew Wilcox
2021-08-24 13:04     ` Matthew Wilcox
2021-08-23 22:15   ` Matthew Wilcox
2021-08-24 18:32     ` Johannes Weiner
2021-08-24 18:59       ` Linus Torvalds
2021-08-24 18:59         ` Linus Torvalds
2021-08-25  6:39         ` Christoph Hellwig
2021-08-24 19:44       ` Matthew Wilcox
2021-08-25 15:13         ` Johannes Weiner [this message]
2021-08-26  0:45           ` Darrick J. Wong
2021-08-27 14:07             ` Johannes Weiner
2021-08-27 18:44               ` Matthew Wilcox
2021-08-27 21:41                 ` Dan Williams
2021-08-27 21:41                   ` Dan Williams
2021-08-27 21:49                   ` Matthew Wilcox
2021-08-30 17:32                 ` Johannes Weiner
2021-08-30 18:22                   ` Matthew Wilcox
2021-08-30 20:27                     ` Johannes Weiner
2021-08-30 21:38                       ` Matthew Wilcox
2021-08-31 17:40                         ` Vlastimil Babka
2021-09-01 17:43                         ` Johannes Weiner
2021-09-02 15:13                           ` Zi Yan
2021-09-06 14:00                             ` Vlastimil Babka
2021-08-31 18:50                       ` Eric W. Biederman
2021-08-31 18:50                         ` Eric W. Biederman
2021-08-26  8:58         ` David Howells
2021-08-27 10:03           ` Johannes Weiner
2021-08-27 12:05             ` Matthew Wilcox
2021-08-27 10:49           ` David Howells
2021-08-24 15:54   ` David Howells
2021-08-24 17:56     ` Matthew Wilcox
2021-08-24 18:26       ` Linus Torvalds
2021-08-24 18:26         ` Linus Torvalds
2021-08-24 18:29         ` Linus Torvalds
2021-08-24 18:29           ` Linus Torvalds
2021-08-24 19:26           ` Theodore Ts'o
2021-08-24 19:34           ` David Howells
2021-08-24 20:02             ` Theodore Ts'o
2021-08-24 21:32             ` David Howells
2021-08-25 12:08               ` Jeff Layton
2021-08-24 19:01         ` Matthew Wilcox
2021-08-24 19:11           ` Linus Torvalds
2021-08-24 19:11             ` Linus Torvalds
2021-08-24 19:23             ` Matthew Wilcox
2021-08-24 19:44               ` Theodore Ts'o
2021-08-24 20:00                 ` Matthew Wilcox
2021-08-25  6:32                 ` Christoph Hellwig
2021-08-25  9:01                   ` Rasmus Villemoes
2021-08-26  6:32                     ` Amir Goldstein
2021-08-26  6:32                       ` Amir Goldstein
2021-08-25 12:03                   ` Jeff Layton
2021-08-25 12:03                     ` Jeff Layton
2021-08-26  0:59                     ` Darrick J. Wong
2021-08-26  4:02                   ` Nicholas Piggin
2021-09-01 12:58                 ` Mike Rapoport
2021-08-24 19:35             ` David Howells
2021-08-24 20:35               ` Vlastimil Babka
2021-08-24 20:40                 ` Vlastimil Babka
2021-08-24 19:11         ` David Howells
2021-08-24 19:25           ` Linus Torvalds
2021-08-24 19:25             ` Linus Torvalds
2021-08-24 19:38             ` Linus Torvalds
2021-08-24 19:38               ` Linus Torvalds
2021-08-24 19:48               ` Linus Torvalds
2021-08-24 19:48                 ` Linus Torvalds
2021-08-26 17:18                 ` Matthew Wilcox
2021-08-24 19:59             ` David Howells
2021-10-05 13:52   ` Matthew Wilcox
2021-10-05 17:29     ` Johannes Weiner
2021-10-05 17:32       ` David Hildenbrand
2021-10-05 18:30       ` Matthew Wilcox
2021-10-05 19:56         ` Jason Gunthorpe
2021-08-28  3:29 ` Matthew Wilcox
2021-09-09 12:43 ` Christoph Hellwig
2021-09-09 13:56   ` Vlastimil Babka
2021-09-09 18:16     ` Johannes Weiner
2021-09-09 18:44       ` Matthew Wilcox
2021-09-09 22:03         ` Johannes Weiner
2021-09-09 22:48           ` Matthew Wilcox
2021-09-09 19:17     ` John Hubbard
2021-09-09 19:23       ` Matthew Wilcox
2021-09-10 20:16 ` Folio discussion recap Kent Overstreet
2021-09-11  1:23   ` Kirill A. Shutemov
2021-09-13 11:32     ` Michal Hocko
2021-09-13 18:12       ` Johannes Weiner
2021-09-15 15:40   ` Johannes Weiner
2021-09-15 17:55     ` Damian Tometzki
2021-09-16  2:58     ` Darrick J. Wong
2021-09-16 16:54       ` Johannes Weiner
2021-09-17  5:24         ` Dave Chinner
2021-09-17  7:18           ` Christoph Hellwig
2021-09-17 16:31           ` Johannes Weiner
2021-09-17 20:57             ` Kirill A. Shutemov
2021-09-17 21:17               ` Kent Overstreet
2021-09-17 22:02                 ` Kirill A. Shutemov
2021-09-17 22:21                   ` Kent Overstreet
2021-09-17 23:15               ` Johannes Weiner
2021-09-20 10:03                 ` Kirill A. Shutemov
2021-09-17 21:13             ` Kent Overstreet
2021-09-17 22:25               ` Theodore Ts'o
2021-09-17 23:35                 ` Josef Bacik
2021-09-18  1:04             ` Dave Chinner
2021-09-18  4:51               ` Kent Overstreet
2021-09-20  1:04                 ` Dave Chinner
2021-09-16 21:58       ` David Howells
2021-09-20  2:17   ` Matthew Wilcox
2021-09-21 19:47     ` Johannes Weiner
2021-09-21 20:38       ` Matthew Wilcox
2021-09-21 21:11         ` Kent Overstreet
2021-09-21 21:22           ` Folios for 5.15 request - Was: re: Folio discussion recap - Kent Overstreet
2021-09-22 15:08             ` Johannes Weiner
2021-09-22 15:46               ` Kent Overstreet
2021-09-22 16:26                 ` Matthew Wilcox
2021-09-22 16:56                   ` Chris Mason
2021-09-22 19:54                     ` Matthew Wilcox
2021-09-22 20:15                       ` Kent Overstreet
2021-09-22 20:21                       ` Linus Torvalds
2021-09-22 20:21                         ` Linus Torvalds
2021-09-23  5:42               ` Kent Overstreet
2021-09-23 18:00                 ` Johannes Weiner
2021-09-23 19:31                   ` Matthew Wilcox
2021-09-23 20:20                   ` Kent Overstreet
2021-10-16  3:28               ` Matthew Wilcox
2021-10-18 16:47                 ` Johannes Weiner
2021-10-18 18:12                   ` Kent Overstreet
2021-10-18 20:45                     ` Johannes Weiner
2021-10-19 16:11                       ` Splitting struct page into multiple types " Kent Overstreet
2021-10-19 17:06                         ` Gao Xiang
2021-10-19 17:34                           ` Matthew Wilcox
2021-10-19 17:54                             ` Gao Xiang
2021-10-20 17:46                               ` Kent Overstreet
2021-10-19 17:37                         ` Jason Gunthorpe
2021-10-19 21:14                       ` David Howells
2021-10-18 18:28                   ` Folios for 5.15 request " Matthew Wilcox
2021-10-18 21:56                     ` Johannes Weiner
2021-10-18 23:16                       ` Kirill A. Shutemov
2021-10-19 15:16                         ` Johannes Weiner
2021-10-20  3:19                           ` Matthew Wilcox
2021-10-20  7:50                           ` David Hildenbrand
2021-10-20 17:26                             ` Matthew Wilcox
2021-10-20 18:04                               ` David Hildenbrand
2021-10-21  6:51                                 ` Christoph Hellwig
2021-10-21  7:21                                   ` David Hildenbrand
2021-10-21 12:03                                     ` Kent Overstreet
2021-10-21 12:35                                       ` David Hildenbrand
2021-10-21 12:38                                         ` Christoph Hellwig
2021-10-21 13:00                                           ` David Hildenbrand
2021-10-21 12:41                                         ` Matthew Wilcox
2021-10-20 17:39                           ` Kent Overstreet
2021-10-21 21:37                             ` Johannes Weiner
2021-10-22  1:52                               ` Matthew Wilcox
2021-10-22  7:59                                 ` David Hildenbrand
2021-10-22 13:01                                   ` Matthew Wilcox
2021-10-22 14:40                                     ` David Hildenbrand
2021-10-23  2:22                                       ` Matthew Wilcox
2021-10-23  5:02                                         ` Christoph Hellwig
2021-10-23  9:58                                         ` David Hildenbrand
2021-10-23 16:00                                           ` Kent Overstreet
2021-10-23 21:41                                             ` Matthew Wilcox
2021-10-23 22:23                                               ` Kent Overstreet
2021-10-25 15:35                                 ` Johannes Weiner
2021-10-25 15:52                                   ` Matthew Wilcox
2021-10-25 16:05                                   ` Kent Overstreet
2021-10-16 19:07               ` Matthew Wilcox
2021-10-18 17:25                 ` Johannes Weiner
2021-09-21 22:18           ` Folio discussion recap Matthew Wilcox
2021-09-23  0:45             ` Ira Weiny
2021-09-23  3:41               ` Matthew Wilcox
2021-09-23 22:12                 ` Ira Weiny
2021-09-29 15:24                   ` Matthew Wilcox
2021-09-21 21:59         ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YSZeKfHxOkEAri1q@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.