All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Theodore Ts'o <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 18/37] ext4: Convert ext4 to read_folio
Date: Mon, 9 May 2022 22:07:05 +0100	[thread overview]
Message-ID: <YnmCeRJe5w2bVF1u@casper.infradead.org> (raw)
In-Reply-To: <Ynl2k7FNEBB0awNs@mit.edu>

On Mon, May 09, 2022 at 04:16:19PM -0400, Theodore Ts'o wrote:
> > but the page
> > cache is absolutely not supposed to be creating large folios for
> > filesystems that haven't indicated their support for such by calling
> > mapping_set_large_folios().
> 
> I think my concern is that at some point in the future, ext4 probably
> *will* want to enable large folios --- and we may want to do so
> selectively.  e.g., just on the read-path, and assume that someone
> will break apart large folios to individual pages on the write path,
> for example.

My mental model doesn't include that last part -- if the filesystem says
it supports large folios, then it supports large folios, including read,
write-into, writeback, etc.  The filesystem can (of course) split the
page itself, but page splitting can fail, and part of the point of all
this is to keep the memory allocation as a single object throughout its
allocation-usage-free lifecycle.

> The question is when do we add all of these sanity check asserts ---
> at the point when ext4 starts making the transition from large folio
> unaware, to large folio kind-of-aware, and hope we don't miss any of
> these interfaces?  Or add those sanity check asserts now, so we get
> reminded that some of these functions may need fixing up when we start
> adding large folio support to the file system?

Right!  I went through the same questions when starting to work on iomap
-- how do we know which of these functions is really capable of dealing
with large folios.  The good news is that xfstests blows up in very
entertaining and obvious ways, so most of this you really can do by
trial and error.  That's obviously a bit unsatisfying ...

As I convert utility functions to take a folio, I try not to leave
landmines; I either convert the function to work on a folio of arbitrary
size, or I leave a VM_BUG_ON_FOLIO in there to help the first person
who tries to use it with a large folio.  Of course, there can always
be places I missed.  If there's anywhere still using a struct page,
that's always cause for greater scrutiny.

> Also, what's the intent for when the MM layer would call
> aops->read_folio() with the intent to fill a huge folio, versus
> calling aops->readahead()?  After all, when we take a page fault,
> it'll be either a 4k page, right?  We currently don't support
> file-backed huge pages; is there a plan to change this?

One of the recent changes to the pagecache (from Jens, iirc) is that
we _always_ call ->readahead() before calling ->read_folio().  That's a
really good thing because it means you can now make your ->read_folio()
synchronous and return the exact error instead of sending off an I/O
and signalling the page "had an error" and returning -EIO to userspace.

The decision about whether to use THP for file-backed faults is
currently left up to userspace.  Commit 4687fdbb805a added support for
the existing MADV_HUGEPAGE to file-backed pages.  I'm not inclined to
leave the decision entirely up to userspace ... I think we should notice
that userspace is behaving in a way that using larger folios would be
beneficial, and eventually end up going all the way to THPs.  However,
the vast majority of applications do not use mmap() on files, so I'm
not too enthusiastic about adding that support until someone comes to
me with a use case.

I did just notice that I need to update the manpage for madvise(2).

> P.S.  On a somewhat unrelated issue, if we have a really large folio
> caused by a 4MB readahead because CIFS really wanted a huge readahead
> size because of the network setup overhead --- and then a single 4k
> page gets dirtied, I imagine the VM subsystem *want* to break apart
> that 4MB folio so that we know that only that single 4k page was
> dirtied, and not require writing back a huge amount of clean 4k pages
> just because we didn't track dirtiness at the right granularity,
> right?

Well ... we don't necessarily grow the folio size all the way up to
the size of the readahead window.  The two are decoupled to a certain
extent (obviously the folio size won't exceed the readahead size!)  And I
currently limit the readahead folio size to PMD_SIZE because I don't want
to track down all the places that assume a PMD page is exactly PMD_SIZE.

But no, the VM doesn't want to break up 4MB pages just because we only
dirtied 4KB of it.  We do tell the FS which parts of it are dirtied, so if
the FS wants to keep track of what is dirty at a sub-folio level, it can.
But it doesn't have to; choosing the granularity of dirtiness is a job
for the filesystem, not the VM.  It can be beneficial to do a larger
write than what is dirty, or some filesystems want to be byte-precise
in what they write back.

If an application has exhibited good locality in reads (enough to get
readahead growing to 4MB), the hope is that it will also exhibit good
locality in writes, so absorbing a lot of writes to the same 4MB page
before it gets written back in a big chunk will actually be a good thing.

Obviously, that's something we'll only know is true once users start
banging on this feature in earnest.  We may need to adjust how we handle
large folios to make users happier.  I don't think we can reasonably
anticipate the problems they'll see.

  reply	other threads:[~2022-05-09 21:08 UTC|newest]

Thread overview: 132+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-08 19:33 [GIT UPDATE] pagecache tree Matthew Wilcox
2022-05-08 20:28 ` [PATCH] Appoint myself page cache maintainer Matthew Wilcox (Oracle)
2022-05-08 23:16   ` Dave Chinner
2022-05-09  1:05     ` Darrick J. Wong
2022-05-09 10:28     ` Jeff Layton
2022-05-11 13:34   ` Christian Brauner
2022-05-12 13:48   ` Vlastimil Babka
2022-05-08 20:28 ` [PATCH] scsicam: Fix use of page cache Matthew Wilcox (Oracle)
2022-05-08 20:29 ` [PATCH 00/25] Remove AOP flags (and related cleanups) Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 01/25] ext4: Use page_symlink() instead of __page_symlink() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 02/25] namei: Merge page_symlink() and __page_symlink() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 03/25] namei: Convert page_symlink() to use memalloc_nofs_save() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 04/25] f2fs: Convert f2fs_grab_cache_page() to use scoped memory APIs Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 05/25] ext4: Allow GFP_FS allocations in ext4_da_convert_inline_data_to_extent() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 06/25] ext4: Use scoped memory API in mext_page_double_lock() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 07/25] ext4: Use scoped memory APIs in ext4_da_write_begin() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 08/25] ext4: Use scoped memory APIs in ext4_write_begin() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 09/25] fs: Remove AOP_FLAG_NOFS Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 10/25] fs: Remove aop_flags parameter from netfs_write_begin() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 11/25] fs: Remove aop flags parameter from block_write_begin() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 12/25] fs: Remove aop flags parameter from cont_write_begin() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 13/25] fs: Remove aop flags parameter from grab_cache_page_write_begin() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 14/25] fs: Remove aop flags parameter from nobh_write_begin() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 15/25] fs: Remove flags parameter from aops->write_begin Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 16/25] buffer: Call aops write_begin() and write_end() directly Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 17/25] namei: " Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 18/25] ntfs3: Call ntfs_write_begin() and ntfs_write_end() directly Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 19/25] ntfs3: Remove fsdata parameter from ntfs_extend_initialized_size() Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 20/25] hfs: Call hfs_write_begin() and generic_write_end() directly Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 21/25] hfsplus: Call hfsplus_write_begin() " Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 22/25] ext4: Call aops write_begin() and write_end() directly Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 23/25] f2fs: " Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 24/25] i915: " Matthew Wilcox (Oracle)
2022-05-08 20:29   ` [PATCH 25/25] fs: Remove pagecache_write_begin() and pagecache_write_end() Matthew Wilcox (Oracle)
2022-05-08 20:30 ` [PATCH 0/3] Pagecache documentation updates Matthew Wilcox (Oracle)
2022-05-08 20:30   ` [PATCH 1/3] filemap: Remove obsolete comment in lock_page Matthew Wilcox (Oracle)
2022-05-09  3:21     ` Miaohe Lin
2022-05-08 20:30   ` [PATCH 2/3] filemap: Update the folio_lock documentation Matthew Wilcox (Oracle)
2022-05-08 20:30   ` [PATCH 3/3] filemap: Update the folio_mark_dirty documentation Matthew Wilcox (Oracle)
2022-05-08 20:30 ` [PATCH 00/37] Convert aops->read_page to aops->read_folio Matthew Wilcox (Oracle)
2022-05-08 20:30   ` [PATCH 01/37] fs: Introduce aops->read_folio Matthew Wilcox (Oracle)
2022-05-08 20:30   ` [PATCH 02/37] fs: Add read_folio documentation Matthew Wilcox (Oracle)
2022-05-08 20:30   ` [PATCH 03/37] fs: Convert netfs_readpage to netfs_read_folio Matthew Wilcox (Oracle)
2022-05-08 20:30   ` [PATCH 04/37] fs: Convert iomap_readpage to iomap_read_folio Matthew Wilcox (Oracle)
2022-05-08 20:30   ` [PATCH 05/37] fs: Convert block_read_full_page() to block_read_full_folio() Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 06/37] fs: Convert mpage_readpage to mpage_read_folio Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 07/37] fs: Convert simple_readpage to simple_read_folio Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 08/37] affs: Convert affs to read_folio Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 09/37] afs: Convert afs_symlink_readpage to afs_symlink_read_folio Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 10/37] befs: Convert befs to read_folio Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 11/37] btrfs: Convert btrfs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 12/37] cifs: Convert cifs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 13/37] coda: Convert coda " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 14/37] cramfs: Convert cramfs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 15/37] ecryptfs: Convert ecryptfs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 16/37] efs: Convert efs symlinks " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 17/37] erofs: Convert erofs zdata " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 18/37] ext4: Convert ext4 " Matthew Wilcox (Oracle)
2022-05-09 13:30     ` Theodore Ts'o
2022-05-09 14:07       ` Matthew Wilcox
2022-05-09 20:16         ` Theodore Ts'o
2022-05-09 21:07           ` Matthew Wilcox [this message]
2022-05-08 20:31   ` [PATCH 19/37] f2fs: Convert f2fs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 20/37] freevxfs: Convert vxfs_immed " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 21/37] fuse: Convert fuse " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 22/37] hostfs: Convert hostfs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 23/37] hpfs: Convert symlinks " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 24/37] isofs: Convert symlinks and zisofs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 25/37] jffs2: Convert jffs2 " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 26/37] jfs: Convert metadata pages " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 27/37] nfs: Convert nfs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 28/37] ntfs: Convert ntfs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 29/37] ocfs2: Convert ocfs2 " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 30/37] orangefs: Convert orangefs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 31/37] romfs: Convert romfs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 32/37] squashfs: Convert squashfs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 33/37] ubifs: Convert ubifs " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 34/37] udf: Convert adinicb and symlinks " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 35/37] vboxsf: Convert vboxsf " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 36/37] mm: Convert swap_readpage to call read_folio instead of readpage Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 37/37] mm,fs: Remove aops->readpage Matthew Wilcox (Oracle)
2022-05-08 20:31 ` [PATCH 0/4] Miscellaneous folio conversions Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 1/4] readahead: Use a folio in read_pages() Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 2/4] fs: Convert is_dirty_writeback() to take a folio Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 3/4] mm/readahead: Convert page_cache_async_readahead " Matthew Wilcox (Oracle)
2022-05-08 20:31   ` [PATCH 4/4] buffer: Rewrite nobh_truncate_page() to use folios Matthew Wilcox (Oracle)
2022-05-08 20:32 ` [PATCH 00/26] Convert aops->releasepage to aops->release_folio Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 01/26] fs: Add aops->release_folio Matthew Wilcox (Oracle)
2022-05-09 10:33     ` Jeff Layton
2022-05-09 12:23       ` Matthew Wilcox
2022-05-08 20:32   ` [PATCH 02/26] iomap: Convert to release_folio Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 03/26] 9p: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 04/26] afs: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 05/26] btrfs: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 06/26] ceph: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 07/26] cifs: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 08/26] erofs: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 09/26] ext4: " Matthew Wilcox (Oracle)
2022-05-09 13:14     ` Theodore Ts'o
2022-05-08 20:32   ` [PATCH 10/26] f2fs: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 11/26] gfs2: " Matthew Wilcox (Oracle)
2022-05-09 12:24     ` Bob Peterson
2022-05-08 20:32   ` [PATCH 12/26] hfs: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 13/26] hfsplus: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 14/26] jfs: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 15/26] nfs: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 16/26] nilfs2: Remove comment about releasepage Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 17/26] ocfs2: Convert to release_folio Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 18/26] orangefs: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 19/26] reiserfs: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 20/26] ubifs: " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 21/26] fs: Remove last vestiges of releasepage Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 22/26] reiserfs: Convert release_buffer_page() to use a folio Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 23/26] jbd2: Convert jbd2_journal_try_to_free_buffers to take " Matthew Wilcox (Oracle)
2022-05-09 13:17     ` Theodore Ts'o
2022-05-08 20:32   ` [PATCH 24/26] jbd2: Convert release_buffer_page() to use " Matthew Wilcox (Oracle)
2022-05-09 13:23     ` Theodore Ts'o
2022-05-09 13:48       ` Matthew Wilcox
2022-05-08 20:32   ` [PATCH 25/26] fs: Change try_to_free_buffers() to take " Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 26/26] fs: Convert drop_buffers() to use " Matthew Wilcox (Oracle)
2022-05-09 10:34   ` [PATCH 00/26] Convert aops->releasepage to aops->release_folio Jeff Layton
2022-05-08 20:32 ` [PATCH 0/4] Unify filler_t and aops->read_folio Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 1/4] jffs2: Pass the file pointer to jffs2_do_readpage_unlock() Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 2/4] nfs: Pass the file pointer to nfs_symlink_filler() Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 3/4] fs: Change the type of filler_t Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 4/4] mm/filemap: Hoist filler_t decision to the top of do_read_cache_folio() Matthew Wilcox (Oracle)
2022-05-08 20:32 ` [PATCH 0/5] Convert aops->freepage to aops->free_folio Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 1/5] fs: Add free_folio address space operation Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 2/5] orangefs: Convert to free_folio Matthew Wilcox (Oracle)
2022-05-08 20:32   ` [PATCH 3/5] nfs: " Matthew Wilcox (Oracle)
2022-05-08 20:33   ` [PATCH 4/5] secretmem: " Matthew Wilcox (Oracle)
2022-05-08 20:33   ` [PATCH 5/5] fs: Remove aops->freepage Matthew Wilcox (Oracle)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YnmCeRJe5w2bVF1u@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.