linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Memory folios
@ 2021-05-10 17:56 Matthew Wilcox
  2021-05-14 17:48 ` Matthew Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2021-05-10 17:56 UTC (permalink / raw)
  To: lsf-pc, linux-mm, linux-fsdevel

I don't know exactly how much will be left to discuss about supporting
larger memory allocation units in the page cache by December.  In my
ideal world, all the patches I've submitted so far are accepted, I
persuade every filesystem maintainer to convert their own filesystem
and struct page is nothing but a bad memory by December.  In reality,
I'm just not that persuasive.

So, probably some kind of discussion will be worthwhile about
converting the remaining filesystems to use folios, when it's worth
having filesystems opt-in to multi-page folios, what we can do about
buffer-head based filesystems, and so on.

Hopefully we aren't still discussing whether folios are a good idea
or not by then.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory folios
  2021-05-10 17:56 [LSF/MM/BPF TOPIC] Memory folios Matthew Wilcox
@ 2021-05-14 17:48 ` Matthew Wilcox
  2021-05-17 10:00   ` Christoph Hellwig
  2021-05-26 21:07   ` Keith Busch
  0 siblings, 2 replies; 5+ messages in thread
From: Matthew Wilcox @ 2021-05-14 17:48 UTC (permalink / raw)
  To: lsf-pc, linux-mm, linux-fsdevel, linux-block, linux-ide,
	linux-scsi, linux-nvme

On Mon, May 10, 2021 at 06:56:17PM +0100, Matthew Wilcox wrote:
> I don't know exactly how much will be left to discuss about supporting
> larger memory allocation units in the page cache by December.  In my
> ideal world, all the patches I've submitted so far are accepted, I
> persuade every filesystem maintainer to convert their own filesystem
> and struct page is nothing but a bad memory by December.  In reality,
> I'm just not that persuasive.
> 
> So, probably some kind of discussion will be worthwhile about
> converting the remaining filesystems to use folios, when it's worth
> having filesystems opt-in to multi-page folios, what we can do about
> buffer-head based filesystems, and so on.
> 
> Hopefully we aren't still discussing whether folios are a good idea
> or not by then.

I got an email from Hannes today asking about memory folios as they
pertain to the block layer, and I thought this would be a good chance
to talk about them.  If you're not familiar with the term "folio",
https://lore.kernel.org/lkml/20210505150628.111735-10-willy@infradead.org/
is not a bad introduction.

Thanks to the work done by Ming Lei in 2017, the block layer already
supports multipage bvecs, so to a first order of approximation, I don't
need anything from the block layer on down through the various storage
layers.  Which is why I haven't been talking to anyone in storage!

It might change (slightly) the contents of bios.  For example,
bvec[n]->bv_offset might now be larger than PAGE_SIZE.  Drivers should
handle this OK, but probably haven't been audited to make sure they do.
Mostly, it's simply that drivers will now see fewer, larger, segments
in their bios.  Once a filesystem supports multipage folios, we will
allocate order-N pages as part of readahead (and sufficiently large
writes).  Dirtiness is tracked on a per-folio basis (not per page),
so folios take trips around the LRU as a single unit and finally make
it to being written back as a single unit.

Drivers still need to cope with sub-folio-sized reads and writes.
O_DIRECT still exists and (eg) doing a sub-page, block-aligned write
will not necessarily cause readaround to happen.  Filesystems may read
and write their own metadata at whatever granularity and alignment they
see fit.  But the vast majority of pagecache I/O will be folio-sized
and folio-aligned.

I do have two small patches which make it easier for the one
filesystem that I've converted so far (iomap/xfs) to add folios to bios
and get folios back out of bios:

https://lore.kernel.org/lkml/20210505150628.111735-72-willy@infradead.org/
https://lore.kernel.org/lkml/20210505150628.111735-73-willy@infradead.org/

as well as a third patch that estimates how large a bio to allocate,
given the current folio that it's working on:
https://git.infradead.org/users/willy/pagecache.git/commitdiff/89541b126a59dc7319ad618767e2d880fcadd6c2

It would be possible to make other changes in future.  For example, if
we decide it'd be better, we could change bvecs from being (page, offset,
length) to (folio, offset, length).  I don't know that it's worth doing;
it would need to be evaluated on its merits.  Personally, I'd rather
see us move to a (phys_addr, length) pair, but I'm a little busy at the
moment.

Hannes has some fun ideas about using the folio work to support larger
sector sizes, and I think they're doable.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory folios
  2021-05-14 17:48 ` Matthew Wilcox
@ 2021-05-17 10:00   ` Christoph Hellwig
  2021-05-26 21:07   ` Keith Busch
  1 sibling, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2021-05-17 10:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-block, linux-ide,
	linux-scsi, linux-nvme

On Fri, May 14, 2021 at 06:48:26PM +0100, Matthew Wilcox wrote:
> it would need to be evaluated on its merits.  Personally, I'd rather
> see us move to a (phys_addr, length) pair, but I'm a little busy at the
> moment.

This is on my todo list.  Fairly high, but after another block layer
heavy lifting project.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory folios
  2021-05-14 17:48 ` Matthew Wilcox
  2021-05-17 10:00   ` Christoph Hellwig
@ 2021-05-26 21:07   ` Keith Busch
  2021-05-27  7:41     ` Hannes Reinecke
  1 sibling, 1 reply; 5+ messages in thread
From: Keith Busch @ 2021-05-26 21:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-block, linux-ide,
	linux-scsi, linux-nvme

On Fri, May 14, 2021 at 06:48:26PM +0100, Matthew Wilcox wrote:
> On Mon, May 10, 2021 at 06:56:17PM +0100, Matthew Wilcox wrote:
> > I don't know exactly how much will be left to discuss about supporting
> > larger memory allocation units in the page cache by December.  In my
> > ideal world, all the patches I've submitted so far are accepted, I
> > persuade every filesystem maintainer to convert their own filesystem
> > and struct page is nothing but a bad memory by December.  In reality,
> > I'm just not that persuasive.
> > 
> > So, probably some kind of discussion will be worthwhile about
> > converting the remaining filesystems to use folios, when it's worth
> > having filesystems opt-in to multi-page folios, what we can do about
> > buffer-head based filesystems, and so on.
> > 
> > Hopefully we aren't still discussing whether folios are a good idea
> > or not by then.
> 
> I got an email from Hannes today asking about memory folios as they
> pertain to the block layer, and I thought this would be a good chance
> to talk about them.  If you're not familiar with the term "folio",
> https://lore.kernel.org/lkml/20210505150628.111735-10-willy@infradead.org/
> is not a bad introduction.
> 
> Thanks to the work done by Ming Lei in 2017, the block layer already
> supports multipage bvecs, so to a first order of approximation, I don't
> need anything from the block layer on down through the various storage
> layers.  Which is why I haven't been talking to anyone in storage!
> 
> It might change (slightly) the contents of bios.  For example,
> bvec[n]->bv_offset might now be larger than PAGE_SIZE.  Drivers should
> handle this OK, but probably haven't been audited to make sure they do.
> Mostly, it's simply that drivers will now see fewer, larger, segments
> in their bios.  Once a filesystem supports multipage folios, we will
> allocate order-N pages as part of readahead (and sufficiently large
> writes).  Dirtiness is tracked on a per-folio basis (not per page),
> so folios take trips around the LRU as a single unit and finally make
> it to being written back as a single unit.
> 
> Drivers still need to cope with sub-folio-sized reads and writes.
> O_DIRECT still exists and (eg) doing a sub-page, block-aligned write
> will not necessarily cause readaround to happen.  Filesystems may read
> and write their own metadata at whatever granularity and alignment they
> see fit.  But the vast majority of pagecache I/O will be folio-sized
> and folio-aligned.
> 
> I do have two small patches which make it easier for the one
> filesystem that I've converted so far (iomap/xfs) to add folios to bios
> and get folios back out of bios:
> 
> https://lore.kernel.org/lkml/20210505150628.111735-72-willy@infradead.org/
> https://lore.kernel.org/lkml/20210505150628.111735-73-willy@infradead.org/
> 
> as well as a third patch that estimates how large a bio to allocate,
> given the current folio that it's working on:
> https://git.infradead.org/users/willy/pagecache.git/commitdiff/89541b126a59dc7319ad618767e2d880fcadd6c2
> 
> It would be possible to make other changes in future.  For example, if
> we decide it'd be better, we could change bvecs from being (page, offset,
> length) to (folio, offset, length).  I don't know that it's worth doing;
> it would need to be evaluated on its merits.  Personally, I'd rather
> see us move to a (phys_addr, length) pair, but I'm a little busy at the
> moment.
> 
> Hannes has some fun ideas about using the folio work to support larger
> sector sizes, and I think they're doable.

I'm also interested in this, and was looking into the exact same thing
recently. Some of the very high capacity SSDs that can really benefit
from better large sector support. If this is a topic for the conference,
I would like to attend this session.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Memory folios
  2021-05-26 21:07   ` Keith Busch
@ 2021-05-27  7:41     ` Hannes Reinecke
  0 siblings, 0 replies; 5+ messages in thread
From: Hannes Reinecke @ 2021-05-27  7:41 UTC (permalink / raw)
  To: Keith Busch, Matthew Wilcox
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-block, linux-ide,
	linux-scsi, linux-nvme

On 5/26/21 11:07 PM, Keith Busch wrote:
> On Fri, May 14, 2021 at 06:48:26PM +0100, Matthew Wilcox wrote:
>> On Mon, May 10, 2021 at 06:56:17PM +0100, Matthew Wilcox wrote:
>>> I don't know exactly how much will be left to discuss about supporting
>>> larger memory allocation units in the page cache by December.  In my
>>> ideal world, all the patches I've submitted so far are accepted, I
>>> persuade every filesystem maintainer to convert their own filesystem
>>> and struct page is nothing but a bad memory by December.  In reality,
>>> I'm just not that persuasive.
>>>
>>> So, probably some kind of discussion will be worthwhile about
>>> converting the remaining filesystems to use folios, when it's worth
>>> having filesystems opt-in to multi-page folios, what we can do about
>>> buffer-head based filesystems, and so on.
>>>
>>> Hopefully we aren't still discussing whether folios are a good idea
>>> or not by then.
>>
>> I got an email from Hannes today asking about memory folios as they
>> pertain to the block layer, and I thought this would be a good chance
>> to talk about them.  If you're not familiar with the term "folio",
>> https://lore.kernel.org/lkml/20210505150628.111735-10-willy@infradead.org/
>> is not a bad introduction.
>>
>> Thanks to the work done by Ming Lei in 2017, the block layer already
>> supports multipage bvecs, so to a first order of approximation, I don't
>> need anything from the block layer on down through the various storage
>> layers.  Which is why I haven't been talking to anyone in storage!
>>
>> It might change (slightly) the contents of bios.  For example,
>> bvec[n]->bv_offset might now be larger than PAGE_SIZE.  Drivers should
>> handle this OK, but probably haven't been audited to make sure they do.
>> Mostly, it's simply that drivers will now see fewer, larger, segments
>> in their bios.  Once a filesystem supports multipage folios, we will
>> allocate order-N pages as part of readahead (and sufficiently large
>> writes).  Dirtiness is tracked on a per-folio basis (not per page),
>> so folios take trips around the LRU as a single unit and finally make
>> it to being written back as a single unit.
>>
>> Drivers still need to cope with sub-folio-sized reads and writes.
>> O_DIRECT still exists and (eg) doing a sub-page, block-aligned write
>> will not necessarily cause readaround to happen.  Filesystems may read
>> and write their own metadata at whatever granularity and alignment they
>> see fit.  But the vast majority of pagecache I/O will be folio-sized
>> and folio-aligned.
>>
>> I do have two small patches which make it easier for the one
>> filesystem that I've converted so far (iomap/xfs) to add folios to bios
>> and get folios back out of bios:
>>
>> https://lore.kernel.org/lkml/20210505150628.111735-72-willy@infradead.org/
>> https://lore.kernel.org/lkml/20210505150628.111735-73-willy@infradead.org/
>>
>> as well as a third patch that estimates how large a bio to allocate,
>> given the current folio that it's working on:
>> https://git.infradead.org/users/willy/pagecache.git/commitdiff/89541b126a59dc7319ad618767e2d880fcadd6c2
>>
>> It would be possible to make other changes in future.  For example, if
>> we decide it'd be better, we could change bvecs from being (page, offset,
>> length) to (folio, offset, length).  I don't know that it's worth doing;
>> it would need to be evaluated on its merits.  Personally, I'd rather
>> see us move to a (phys_addr, length) pair, but I'm a little busy at the
>> moment.
>>
>> Hannes has some fun ideas about using the folio work to support larger
>> sector sizes, and I think they're doable.
> 
> I'm also interested in this, and was looking into the exact same thing
> recently. Some of the very high capacity SSDs that can really benefit
> from better large sector support. If this is a topic for the conference,
> I would like to attend this session.
> 
And, of course, so would I :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-05-27  7:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-10 17:56 [LSF/MM/BPF TOPIC] Memory folios Matthew Wilcox
2021-05-14 17:48 ` Matthew Wilcox
2021-05-17 10:00   ` Christoph Hellwig
2021-05-26 21:07   ` Keith Busch
2021-05-27  7:41     ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).