linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Hannes Reinecke <hare@suse.de>
Cc: Pankaj Raghav <p.raghav@samsung.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Hellwig <hch@lst.de>,
	Luis Chamberlain <mcgrof@kernel.org>,
	gost.dev@samsung.com
Subject: Re: [PATCH 6/7] mm/filemap: allocate folios with mapping blocksize
Date: Tue, 20 Jun 2023 08:57:48 +1000	[thread overview]
Message-ID: <ZJDdbPwfXI6eR5vB@dread.disaster.area> (raw)
In-Reply-To: <b6d982ce-3e7e-e433-8339-28ec8474df03@suse.de>

On Mon, Jun 19, 2023 at 10:42:38AM +0200, Hannes Reinecke wrote:
> On 6/19/23 10:08, Pankaj Raghav wrote:
> > Hi Hannes,
> > On Wed, Jun 14, 2023 at 01:46:36PM +0200, Hannes Reinecke wrote:
> > > The mapping has an underlying blocksize (by virtue of
> > > mapping->host->i_blkbits), so if the mapping blocksize
> > > is larger than the pagesize we should allocate folios
> > > in the correct order.
> > > 
> > Network filesystems such as 9pfs set the blkbits to be maximum data it
> > wants to transfer leading to unnecessary memory pressure as we will try
> > to allocate higher order folios(Order 5 in my setup). Isn't it better
> > for each filesystem to request the minimum folio order it needs for its
> > page cache early on? Block devices can do the same for its block cache.

Folio size is not a "filesystem wide" thing - it's a per-inode
configuration. We can have inodes within a filesystem that have
different "block" sizes. A prime example of this is XFS directories
- they can have 64kB block sizes on 4kB block size filesystem.

Another example is extent size hints in XFS data files - they
trigger aligned allocation-around similar to using large folios in
the page cache for small writes. Effectively this gives data files a
"block size" of the extent size hint regardless of the filesystem
block size.

Hence in future we might want different sizes of folios for
different types of inodes and so whatever we do we need to support
per-inode folio size configuration for the inode mapping tree.

> > I have prototype along those lines and I will it soon. This is also
> > something willy indicated before in a mailing list conversation.
> > 
> Well; I _though_ that's why we had things like optimal I/O size and
> maximal I/O size. But this seem to be relegated to request queue limits,
> so I guess it's not available from 'struct block_device' or 'struct
> gendisk'.

Yes, those are block device constructs to enable block device based
filesystems to be laid out best for the given block device. They
don't exist for non-block-based filesystems like network
filesystems...

> So I've been thinking of adding a flag somewhere (possibly in
> 'struct address_space') to indicate that blkbits is a hard limit
> and not just an advisory thing.

This still relies on interpreting inode->i_blkbits repeatedly at
runtime in some way, in mm code that really has no business looking
at filesystem block sizes.

What is needed is a field into the mapping that defines the
folio order that all folios allocated for the page cache must be
aligned/sized to to allow them to be inserted into the mapping.

This means the minimum folio order and alignment is maintained
entirely by the mapping (e.g. it allows truncate to do the right
thing), and the filesystem/device side code does not need to do
anything special (except support large folios) to ensure that the
page cache always contains folios that are block sized and aligned.

We already have mapping_set_large_folios() that we use at
inode/mapping instantiation time to enable large folios in the page
cache for that mapping. What we need is a new
mapping_set_large_folio_order() API to enable the filesystem/device
to set the base folio order for the mapping tree at instantiation
time, and for all the page cache instantiation code to align/size to
the order stored in the mapping...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2023-06-19 22:57 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-14 11:46 [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
2023-06-14 11:46 ` [PATCH 1/7] brd: use XArray instead of radix-tree to index backing pages Hannes Reinecke
2023-06-14 12:45   ` Matthew Wilcox
2023-06-14 12:50     ` Pankaj Raghav
2023-06-14 13:03       ` Hannes Reinecke
2023-06-14 11:46 ` [PATCH 2/7] brd: convert to folios Hannes Reinecke
2023-06-14 13:45   ` Matthew Wilcox
2023-06-14 13:50     ` Hannes Reinecke
2023-06-14 11:46 ` [PATCH 3/7] brd: abstract page_size conventions Hannes Reinecke
2023-06-14 11:46 ` [PATCH 4/7] brd: make sector size configurable Hannes Reinecke
2023-06-14 12:55   ` Matthew Wilcox
2023-06-14 13:02     ` Hannes Reinecke
2023-06-15  2:17   ` Dave Chinner
2023-06-15  5:55     ` Christoph Hellwig
2023-06-15  6:33       ` Hannes Reinecke
2023-06-15  6:23     ` Hannes Reinecke
2023-06-14 11:46 ` [PATCH 5/7] brd: make logical " Hannes Reinecke
2023-06-14 11:46 ` [PATCH 6/7] mm/filemap: allocate folios with mapping blocksize Hannes Reinecke
     [not found]   ` <CGME20230619080901eucas1p224e67aa31866d2ad8d259b2209c2db67@eucas1p2.samsung.com>
2023-06-19  8:08     ` Pankaj Raghav
2023-06-19  8:42       ` Hannes Reinecke
2023-06-19 22:57         ` Dave Chinner [this message]
2023-06-20  0:00           ` Matthew Wilcox
2023-06-20  5:57           ` Hannes Reinecke
2023-06-14 11:46 ` [PATCH 7/7] mm/readahead: align readahead down to " Hannes Reinecke
2023-06-14 13:17 ` [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
2023-06-14 13:53   ` Matthew Wilcox
2023-06-14 15:06     ` Hannes Reinecke
2023-06-14 15:35       ` Hannes Reinecke
2023-06-14 17:46         ` Matthew Wilcox
2023-06-14 23:53       ` Dave Chinner
2023-06-15  6:21         ` Hannes Reinecke
2023-06-15  8:51           ` Dave Chinner
2023-06-16 16:06             ` Kent Overstreet
2023-06-15  3:44       ` Dave Chinner
2023-06-14 13:48 ` [PATCH 1/2] highmem: Add memcpy_to_folio() Matthew Wilcox (Oracle)
2023-06-14 18:38   ` kernel test robot
2023-06-14 19:30   ` kernel test robot
2023-06-15  5:58   ` Christoph Hellwig
2023-06-15 12:16     ` Matthew Wilcox
2023-06-14 13:48 ` [PATCH 2/2] highmem: Add memcpy_from_folio() Matthew Wilcox (Oracle)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZJDdbPwfXI6eR5vB@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=gost.dev@samsung.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).