linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-cachefs@redhat.com,
	linux-afs@lists.infradead.org
Subject: Re: [PATCH v5 00/27] Memory Folios
Date: Mon, 22 Mar 2021 18:47:44 +0000	[thread overview]
Message-ID: <20210322184744.GU1719932@casper.infradead.org> (raw)
In-Reply-To: <YFja/LRC1NI6quL6@cmpxchg.org>

On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> On Sat, Mar 20, 2021 at 05:40:37AM +0000, Matthew Wilcox (Oracle) wrote:
> > This series introduces the 'struct folio' as a replacement for
> > head-or-base pages.  This initial set reduces the kernel size by
> > approximately 6kB, although its real purpose is adding infrastructure
> > to enable further use of the folio.
> > 
> > The intent is to convert all filesystems and some device drivers to work
> > in terms of folios.  This series contains a lot of explicit conversions,
> > but it's important to realise it's removing a lot of implicit conversions
> > in some relatively hot paths.  There will be very few conversions from
> > folios when this work is completed; filesystems, the page cache, the
> > LRU and so on will generally only deal with folios.
> 
> If that is the case, shouldn't there in the long term only be very
> few, easy to review instances of things like compound_head(),
> PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
> never see tail pages and 2) never assume a compile-time page size?

I don't know exactly where we get to eventually.  There are definitely
some aspects of the filesystem<->mm interface which are page-based
(eg ->fault needs to look up the exact page, regardless of its
head/tail/base nature), while ->readpage needs to talk in terms of
folios.

> What are the higher-level places that in the long-term should be
> dealing with tail pages at all? Are there legit ones besides the page
> allocator, THP splitting internals & pte-mapped compound pages?

I can't tell.  I think this patch maybe illustrates some of the
problems, but maybe it's just an intermediate problem:

https://git.infradead.org/users/willy/pagecache.git/commitdiff/047e9185dc146b18f56c6df0b49fe798f1805c7b

It deals mostly in terms of folios, but when it needs to kmap() and
memcmp(), then it needs to work in terms of pages.  I don't think it's
avoidable (maybe we bury the "dealing with pages" inside a kmap()
wrapper somewhere, but I'm not sure that's better).

> I do agree that the current confusion around which layer sees which
> types of pages is a problem. But I also think a lot of it is the
> result of us being in a transitional period where we've added THP in
> more places but not all code and data structures are or were fully
> native yet, and so we had things leak out or into where maybe they
> shouldn't be to make things work in the short term.
> 
> But this part is already getting better, and has gotten better, with
> the page cache (largely?) going native for example.

Thanks ;-)  There's still more work to do on that (ie storing one
entry to cover 512 indices instead of 512 identical entries), but it
is getting better.  What can't be made better is the CPU page tables;
they really do need to point to tail pages.

One of my longer-term goals is to support largeish pages on ARM (and
other CPUs).  Instead of these silly config options to have 16KiB
or 64KiB pages, support "add PTEs for these 16 consecutive, aligned pages".
And I'm not sure how we do that without folios.  The notion that a
page is PAGE_SIZE is really, really ingrained.  I tried the page_size()
macro to make things easier, but there's 17000 instances of PAGE_SIZE
in the tree, and they just aren't going to go away.

> Some compound_head() that are currently in the codebase are already
> unnecessary. Like the one in activate_page().

Right!  And it's hard to find & remove them without very careful analysis,
or particularly deep knowledge.  With folios, we can remove them without
terribly deep thought.

> And looking at grep, I wouldn't be surprised if only the page table
> walkers need the page_compound() that mark_page_accessed() does. We
> would be better off if they did the translation once and explicitly in
> the outer scope, where it's clear they're dealing with a pte-mapped
> compound page, instead of having a series of rather low level helpers
> (page flags testing, refcount operations, LRU operations, stat
> accounting) all trying to be clever but really just obscuring things
> and imposing unnecessary costs on the vast majority of cases.
> 
> So I fully agree with the motivation behind this patch. But I do
> wonder why it's special-casing the commmon case instead of the rare
> case. It comes at a huge cost. Short term, the churn of replacing
> 'page' with 'folio' in pretty much all instances is enormous.

Because people (think they) know what a page is.  It's PAGE_SIZE bytes
long, it occupies one PTE, etc, etc.  A folio is new and instead of
changing how something familiar (a page) behaves, we're asking them
to think about something new instead that behaves a lot like a page,
but has differences.

> And longer term, I'm not convinced folio is the abstraction we want
> throughout the kernel. If nobody should be dealing with tail pages in
> the first place, why are we making everybody think in 'folios'? Why
> does a filesystem care that huge pages are composed of multiple base
> pages internally? This feels like an implementation detail leaking out
> of the MM code. The vast majority of places should be thinking 'page'
> with a size of 'page_size()'. Including most parts of the MM itself.

I think pages already leaked out of the MM and into filesystems (and
most of the filesystem writers seem pretty unknowledgable about how
pages and the page cache work, TBH).  That's OK!  Or it should be OK.
Filesystem authors should be experts on how their filesystem works.
Everywhere that they have to learn about the page cache is a distraction
and annoyance for them.

I mean, I already tried what you're suggesting.  It's really freaking
hard.  It's hard to do, it's hard to explain, it's hard to know if you
got it right.  With folios, I've got the compiler working for me, telling
me that I got some of the low-level bits right (or wrong), leaving me
free to notice "Oh, wait, we got the accounting wrong because writeback
assumes that a page is only PAGE_SIZE bytes".  I would _never_ have
noticed that with the THP tree.  I only noticed it because transitioning
things to folios made me read the writeback code and wonder about the
'inc_wb_stat' call, see that it's measuring something in 'number of pages'
and realise that the wb_stat accounting needs to be fixed.

> The compile-time check is nice, but I'm not sure it would be that much
> more effective at catching things than a few centrally placed warns
> inside PageFoo(), get_page() etc. and other things that should not
> encounter tail pages in the first place (with __helpers for the few
> instances that do). And given the invasiveness of this change, they
> ought to be very drastically better at it, and obviously so, IMO.

We should have come up with a new type 15 years ago instead of doing THP.
But the second best time to invent a new type for "memory objects which
are at least as big as a page" is right now.  Because it only gets more
painful over time.

  reply	other threads:[~2021-03-22 18:49 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-20  5:40 [PATCH v5 00/27] Memory Folios Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 01/27] fs/cachefiles: Remove wait_bit_key layout dependency Matthew Wilcox (Oracle)
2021-03-22  8:06   ` Christoph Hellwig
2021-03-20  5:40 ` [PATCH v5 02/27] mm/writeback: Add wait_on_page_writeback_killable Matthew Wilcox (Oracle)
2021-03-22  8:07   ` Christoph Hellwig
2021-03-20  5:40 ` [PATCH v5 03/27] afs: Use wait_on_page_writeback_killable Matthew Wilcox (Oracle)
2021-03-22  8:08   ` Christoph Hellwig
2021-03-20  5:40 ` [PATCH v5 04/27] mm: Introduce struct folio Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 05/27] mm: Add folio_pgdat and folio_zone Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 06/27] mm/vmstat: Add functions to account folio statistics Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 07/27] mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 08/27] mm: Add put_folio Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 09/27] mm: Add get_folio Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 10/27] mm: Create FolioFlags Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 11/27] mm: Handle per-folio private data Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 12/27] mm: Add folio_index, folio_file_page and folio_contains Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 13/27] mm/util: Add folio_mapping and folio_file_mapping Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 14/27] mm/memcg: Add folio wrappers for various functions Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 15/27] mm/filemap: Add unlock_folio Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 16/27] mm/filemap: Add lock_folio Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 17/27] mm/filemap: Add lock_folio_killable Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 18/27] mm/filemap: Add __lock_folio_async Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 19/27] mm/filemap: Add __lock_folio_or_retry Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 20/27] mm/filemap: Add wait_on_folio_locked Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 21/27] mm/filemap: Add end_folio_writeback Matthew Wilcox (Oracle)
2021-03-20  5:40 ` [PATCH v5 22/27] mm/writeback: Add wait_on_folio_writeback Matthew Wilcox (Oracle)
2021-03-20  5:41 ` [PATCH v5 23/27] mm/writeback: Add wait_for_stable_folio Matthew Wilcox (Oracle)
2021-03-20  5:41 ` [PATCH v5 24/27] mm/filemap: Convert wait_on_page_bit to wait_on_folio_bit Matthew Wilcox (Oracle)
2021-03-21  7:10   ` kernel test robot
2021-03-20  5:41 ` [PATCH v5 25/27] mm/filemap: Convert wake_up_page_bit to wake_up_folio_bit Matthew Wilcox (Oracle)
2021-03-20  5:41 ` [PATCH v5 26/27] mm/filemap: Convert page wait queues to be folios Matthew Wilcox (Oracle)
2021-03-20  7:54   ` kernel test robot
2021-03-20  5:41 ` [PATCH v5 27/27] mm/doc: Build kerneldoc for various mm files Matthew Wilcox (Oracle)
2021-03-22  3:25 ` [PATCH v5 00/27] Memory Folios Matthew Wilcox
2021-03-22  9:25 ` [PATCH v5 01/27] fs/cachefiles: Remove wait_bit_key layout dependency David Howells
2021-03-22  9:26 ` [PATCH v5 02/27] mm/writeback: Add wait_on_page_writeback_killable David Howells
2021-03-22  9:27 ` [PATCH v5 03/27] afs: Use wait_on_page_writeback_killable David Howells
2021-03-22 19:41   ` Matthew Wilcox
2021-03-22 17:59 ` [PATCH v5 00/27] Memory Folios Johannes Weiner
2021-03-22 18:47   ` Matthew Wilcox [this message]
2021-03-24  0:29     ` Johannes Weiner
2021-03-24  6:24       ` Matthew Wilcox
2021-03-26 17:48         ` Johannes Weiner
2021-03-29 16:58           ` Matthew Wilcox
2021-03-29 17:56             ` Matthew Wilcox
2021-03-30 19:30             ` Johannes Weiner
2021-03-30 21:09               ` Matthew Wilcox
2021-03-31 18:14                 ` Johannes Weiner
2021-03-31 18:28                   ` Matthew Wilcox
2021-04-01  5:05                 ` Al Viro
2021-04-01 12:07                   ` Matthew Wilcox
2021-04-01 16:00                   ` Johannes Weiner
2021-03-31 14:54               ` Christoph Hellwig
2021-03-23 15:50   ` Christoph Hellwig
2021-03-23 11:29 ` [PATCH v5 03/27] afs: Use wait_on_page_writeback_killable David Howells
2021-03-23 17:50 ` [PATCH v5 00/27] Memory Folios David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210322184744.GU1719932@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-afs@lists.infradead.org \
    --cc=linux-cachefs@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).