linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [GIT PULL] Memory folios for v5.15
Date: Tue, 24 Aug 2021 14:32:56 -0400	[thread overview]
Message-ID: <YSU7WCYAY+ZRy+Ke@cmpxchg.org> (raw)
In-Reply-To: <YSQeFPTMn5WpwyAa@casper.infradead.org>

On Mon, Aug 23, 2021 at 11:15:48PM +0100, Matthew Wilcox wrote:
> On Mon, Aug 23, 2021 at 05:26:41PM -0400, Johannes Weiner wrote:
> > However, this far exceeds the goal of a better mm-fs interface. And
> > the value proposition of a full MM-internal conversion, including
> > e.g. the less exposed anon page handling, is much more nebulous. It's
> > been proposed to leave anon pages out, but IMO to keep that direction
> > maintainable, the folio would have to be translated to a page quite
> > early when entering MM code, rather than propagating it inward, in
> > order to avoid huge, massively overlapping page and folio APIs.
> 
> I only intend to leave anonymous memory out /for now/.  My hope is
> that somebody else decides to work on it (and indeed Google have
> volunteered someone for the task).

Unlike the filesystem side, this seems like a lot of churn for very
little tangible value. And leaves us with an end result that nobody
appears to be terribly excited about.

But the folio abstraction is too low-level to use JUST for file cache
and NOT for anon. It's too close to the page layer itself and would
duplicate too much of it to be maintainable side by side.

That's why I asked why it couldn't be a more abstract memory unit for
managing file cache. With a clearer delineation between that and how
the backing memory is implemented - 1 page, N pages, maybe just a part
of a page later on. And not just be a different name for a head page.

It appears David is asking the same in the parallel subthread.

> > It's also not clear to me that using the same abstraction for compound
> > pages and the file cache object is future proof. It's evident from
> > scalability issues in the allocator, reclaim, compaction, etc. that
> > with current memory sizes and IO devices, we're hitting the limits of
> > efficiently managing memory in 4k base pages per default. It's also
> > clear that we'll continue to have a need for 4k cache granularity for
> > quite a few workloads that work with large numbers of small files. I'm
> > not sure how this could be resolved other than divorcing the idea of a
> > (larger) base page from the idea of cache entries that can correspond,
> > if necessary, to memory chunks smaller than a default page.
> 
> That sounds to me exactly like folios, except for the naming.

Then I think you misunderstood me.

> From the MM point of view, it's less churn to do it your way, but
> from the point of view of the rest of the kernel, there's going to
> be unexpected consequences.  For example, btrfs didn't support page
> size != block size until just recently (and I'm not sure it's
> entirely fixed yet?)
> 
> And there's nobody working on your idea.  At least not that have surfaced
> so far.  The folio patch is here now.
> 
> Folios are also variable sized.  For files which are small, we still only
> allocate 4kB to cache them.  If the file is accessed entirely randomly,
> we only allocate 4kB chunks at a time.  We only allocate larger folios
> when we think there is an advantage to doing so.
> 
> This benefit is retained if someone does come along to change PAGE_SIZE
> to 16KiB (or whatever).  Folios can still be composed of multiple pages,
> no matter what the PAGE_SIZE is.

The folio doc says "It is at least as large as %PAGE_SIZE";
folio_order() says "A folio is composed of 2^order pages";
page_folio(), folio_pfn(), folio_nr_pages all encode a N:1
relationship. And yes, the name implies it too.

This is in direct conflict with what I'm talking about, where base
page granularity could become coarser than file cache granularity.

Are we going to bump struct page to 2M soon? I don't know. Here is
what I do know about 4k pages, though:

- It's a lot of transactional overhead to manage tens of gigs of
  memory in 4k pages. We're reclaiming, paging and swapping more than
  ever before in our DCs, because flash provides in abundance the
  low-latency IOPS required for that, and parking cold/warm workload
  memory on cheap flash saves expensive RAM. But we're continously
  scanning thousands of pages per second to do this. There was also
  the RWF_UNCACHED thread around reclaim CPU overhead at the higher
  end of buffered IO rates. There is the fact that we have a pending
  proposal from Google to replace rmap because it's too CPU-intense
  when paging into compressed memory pools.

- It's a lot of internal fragmentation. Compaction is becoming the
  default method for allocating the majority of memory in our
  servers. This is a latency concern during page faults, and a
  predictability concern when we defer it to khugepaged collapsing.

- struct page is statically eating gigs of expensive memory on every
  single machine, when only some of our workloads would require this
  level of granularity for some of their memory. And that's *after*
  we're fighting over every bit in that structure.

Base page size becoming bigger than cache entries in the near future
doesn't strike me as an exotic idea. The writing seems to be on the
wall. But the folio appears full of assumptions that conflict with it.

Sure, the patch is here now. But how much time will all the churn buy
us before we may need a do-over? Would clean, incremental changes to
the cache entry abstraction even be possible after we have anon and
all kinds of other compound page internals hanging off of it as well?

Wouldn't it make more sense to decouple filesystems from "paginess",
as David puts it, now instead? Avoid the risk of doing it twice, avoid
the more questionable churn inside mm code, avoid the confusing
proximity to the page and its API in the long-term...

  reply	other threads:[~2021-08-24 18:31 UTC|newest]

Thread overview: 162+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-23 19:01 [GIT PULL] Memory folios for v5.15 Matthew Wilcox
2021-08-23 21:26 ` Johannes Weiner
2021-08-23 22:06   ` Linus Torvalds
2021-08-24  2:20     ` Matthew Wilcox
2021-08-24 13:04     ` Matthew Wilcox
2021-08-23 22:15   ` Matthew Wilcox
2021-08-24 18:32     ` Johannes Weiner [this message]
2021-08-24 18:59       ` Linus Torvalds
2021-08-25  6:39         ` Christoph Hellwig
2021-08-24 19:44       ` Matthew Wilcox
2021-08-25 15:13         ` Johannes Weiner
2021-08-26  0:45           ` Darrick J. Wong
2021-08-27 14:07             ` Johannes Weiner
2021-08-27 18:44               ` Matthew Wilcox
2021-08-27 21:41                 ` Dan Williams
2021-08-27 21:49                   ` Matthew Wilcox
2021-08-30 17:32                 ` Johannes Weiner
2021-08-30 18:22                   ` Matthew Wilcox
2021-08-30 20:27                     ` Johannes Weiner
2021-08-30 21:38                       ` Matthew Wilcox
2021-08-31 17:40                         ` Vlastimil Babka
2021-09-01 17:43                         ` Johannes Weiner
2021-09-02 15:13                           ` Zi Yan
2021-09-06 14:00                             ` Vlastimil Babka
2021-08-31 18:50                       ` Eric W. Biederman
2021-08-26  8:58         ` David Howells
2021-08-27 10:03           ` Johannes Weiner
2021-08-27 12:05             ` Matthew Wilcox
2021-08-27 10:49           ` David Howells
2021-08-24 15:54   ` David Howells
2021-08-24 17:56     ` Matthew Wilcox
2021-08-24 18:26       ` Linus Torvalds
2021-08-24 18:29         ` Linus Torvalds
2021-08-24 19:26           ` Theodore Ts'o
2021-08-24 19:34           ` David Howells
2021-08-24 20:02             ` Theodore Ts'o
2021-08-24 21:32             ` David Howells
2021-08-25 12:08               ` Jeff Layton
2021-08-24 19:01         ` Matthew Wilcox
2021-08-24 19:11           ` Linus Torvalds
2021-08-24 19:23             ` Matthew Wilcox
2021-08-24 19:44               ` Theodore Ts'o
2021-08-24 20:00                 ` Matthew Wilcox
2021-08-25  6:32                 ` Christoph Hellwig
2021-08-25  9:01                   ` Rasmus Villemoes
2021-08-26  6:32                     ` Amir Goldstein
2021-08-25 12:03                   ` Jeff Layton
2021-08-26  0:59                     ` Darrick J. Wong
2021-08-26  4:02                   ` Nicholas Piggin
2021-09-01 12:58                 ` Mike Rapoport
2021-08-24 19:35             ` David Howells
2021-08-24 20:35               ` Vlastimil Babka
2021-08-24 20:40                 ` Vlastimil Babka
2021-08-24 19:11         ` David Howells
2021-08-24 19:25           ` Linus Torvalds
2021-08-24 19:38             ` Linus Torvalds
2021-08-24 19:48               ` Linus Torvalds
2021-08-26 17:18                 ` Matthew Wilcox
2021-08-24 19:59             ` David Howells
2021-10-05 13:52   ` Matthew Wilcox
2021-10-05 17:29     ` Johannes Weiner
2021-10-05 17:32       ` David Hildenbrand
2021-10-05 18:30       ` Matthew Wilcox
2021-10-05 19:56         ` Jason Gunthorpe
2021-08-28  3:29 ` Matthew Wilcox
2021-09-09 12:43 ` Christoph Hellwig
2021-09-09 13:56   ` Vlastimil Babka
2021-09-09 18:16     ` Johannes Weiner
2021-09-09 18:44       ` Matthew Wilcox
2021-09-09 22:03         ` Johannes Weiner
2021-09-09 22:48           ` Matthew Wilcox
2021-09-09 19:17     ` John Hubbard
2021-09-09 19:23       ` Matthew Wilcox
2021-09-10 20:16 ` Folio discussion recap Kent Overstreet
2021-09-11  1:23   ` Kirill A. Shutemov
2021-09-13 11:32     ` Michal Hocko
2021-09-13 18:12       ` Johannes Weiner
2021-09-15 15:40   ` Johannes Weiner
2021-09-15 17:55     ` Damian Tometzki
2021-09-16  2:58     ` Darrick J. Wong
2021-09-16 16:54       ` Johannes Weiner
2021-09-17  5:24         ` Dave Chinner
2021-09-17  7:18           ` Christoph Hellwig
2021-09-17 16:31           ` Johannes Weiner
2021-09-17 20:57             ` Kirill A. Shutemov
2021-09-17 21:17               ` Kent Overstreet
2021-09-17 22:02                 ` Kirill A. Shutemov
2021-09-17 22:21                   ` Kent Overstreet
2021-09-17 23:15               ` Johannes Weiner
2021-09-20 10:03                 ` Kirill A. Shutemov
2021-09-17 21:13             ` Kent Overstreet
2021-09-17 22:25               ` Theodore Ts'o
2021-09-17 23:35                 ` Josef Bacik
2021-09-18  1:04             ` Dave Chinner
2021-09-18  4:51               ` Kent Overstreet
2021-09-20  1:04                 ` Dave Chinner
2021-09-16 21:58       ` David Howells
2021-09-20  2:17   ` Matthew Wilcox
2021-09-21 19:47     ` Johannes Weiner
2021-09-21 20:38       ` Matthew Wilcox
2021-09-21 21:11         ` Kent Overstreet
2021-09-21 21:22           ` Folios for 5.15 request - Was: re: Folio discussion recap - Kent Overstreet
2021-09-22 15:08             ` Johannes Weiner
2021-09-22 15:46               ` Kent Overstreet
2021-09-22 16:26                 ` Matthew Wilcox
2021-09-22 16:56                   ` Chris Mason
2021-09-22 19:54                     ` Matthew Wilcox
2021-09-22 20:15                       ` Kent Overstreet
2021-09-22 20:21                       ` Linus Torvalds
2021-09-23  5:42               ` Kent Overstreet
2021-09-23 18:00                 ` Johannes Weiner
2021-09-23 19:31                   ` Matthew Wilcox
2021-09-23 20:20                   ` Kent Overstreet
2021-10-16  3:28               ` Matthew Wilcox
2021-10-18 16:47                 ` Johannes Weiner
2021-10-18 18:12                   ` Kent Overstreet
2021-10-18 20:45                     ` Johannes Weiner
2021-10-19 16:11                       ` Splitting struct page into multiple types " Kent Overstreet
2021-10-19 17:06                         ` Gao Xiang
2021-10-19 17:34                           ` Matthew Wilcox
2021-10-19 17:54                             ` Gao Xiang
2021-10-20 17:46                               ` Kent Overstreet
2021-10-19 17:37                         ` Jason Gunthorpe
2021-10-19 21:14                       ` David Howells
2021-10-18 18:28                   ` Folios for 5.15 request " Matthew Wilcox
2021-10-18 21:56                     ` Johannes Weiner
2021-10-18 23:16                       ` Kirill A. Shutemov
2021-10-19 15:16                         ` Johannes Weiner
2021-10-20  3:19                           ` Matthew Wilcox
2021-10-20  7:50                           ` David Hildenbrand
2021-10-20 17:26                             ` Matthew Wilcox
2021-10-20 18:04                               ` David Hildenbrand
2021-10-21  6:51                                 ` Christoph Hellwig
2021-10-21  7:21                                   ` David Hildenbrand
2021-10-21 12:03                                     ` Kent Overstreet
2021-10-21 12:35                                       ` David Hildenbrand
2021-10-21 12:38                                         ` Christoph Hellwig
2021-10-21 13:00                                           ` David Hildenbrand
2021-10-21 12:41                                         ` Matthew Wilcox
2021-10-20 17:39                           ` Kent Overstreet
2021-10-21 21:37                             ` Johannes Weiner
2021-10-22  1:52                               ` Matthew Wilcox
2021-10-22  7:59                                 ` David Hildenbrand
2021-10-22 13:01                                   ` Matthew Wilcox
2021-10-22 14:40                                     ` David Hildenbrand
2021-10-23  2:22                                       ` Matthew Wilcox
2021-10-23  5:02                                         ` Christoph Hellwig
2021-10-23  9:58                                         ` David Hildenbrand
2021-10-23 16:00                                           ` Kent Overstreet
2021-10-23 21:41                                             ` Matthew Wilcox
2021-10-23 22:23                                               ` Kent Overstreet
2021-10-25 15:35                                 ` Johannes Weiner
2021-10-25 15:52                                   ` Matthew Wilcox
2021-10-25 16:05                                   ` Kent Overstreet
2021-10-16 19:07               ` Matthew Wilcox
2021-10-18 17:25                 ` Johannes Weiner
2021-09-21 22:18           ` Folio discussion recap Matthew Wilcox
2021-09-23  0:45             ` Ira Weiny
2021-09-23  3:41               ` Matthew Wilcox
2021-09-23 22:12                 ` Ira Weiny
2021-09-29 15:24                   ` Matthew Wilcox
2021-09-21 21:59         ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YSU7WCYAY+ZRy+Ke@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).