All of lore.kernel.org
 help / color / mirror / Atom feed
* folio_map
@ 2022-08-16 18:08 Matthew Wilcox
  2022-08-16 21:23 ` folio_map John Hubbard
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Matthew Wilcox @ 2022-08-16 18:08 UTC (permalink / raw)
  To: Thomas Gleixner, Ira Weiny, Fabio M. De Francesco,
	Luis Chamberlain, linux-fsdevel, linux-mm

Some of you will already know all this, but I'll go into a certain amount
of detail for the peanut gallery.

One of the problems that people want to solve with multi-page folios
is supporting filesystem block sizes > PAGE_SIZE.  Such filesystems
already exist; you can happily create a 64kB block size filesystem on
a PPC/ARM/... today, then fail to mount it on an x86 machine.

kmap_local_folio() only lets you map a single page from a folio.
This works for the majority of cases (eg ->write_begin() works on a
per-page basis *anyway*, so we can just map a single page from the folio).
But this is somewhat hampering for ext2_get_page(), used for directory
handling.  A directory record may cross a page boundary (because it
wasn't a page boundary on the machine which created the filesystem),
and juggling two pages being mapped at once is tricky with the stack
model for kmap_local.

I don't particularly want to invest heavily in optimising for HIGHMEM.
The number of machines which will use multi-page folios and HIGHMEM is
not going to be large, one hopes, as 64-bit kernels are far more common.
I'm happy for 32-bit to be slow, as long as it works.

For these reasons, I proposing the logical equivalent to this:

+void *folio_map_local(struct folio *folio)
+{
+       if (!IS_ENABLED(CONFIG_HIGHMEM))
+               return folio_address(folio);
+       if (!folio_test_large(folio))
+               return kmap_local_page(&folio->page);
+       return vmap_folio(folio);
+}
+
+void folio_unmap_local(const void *addr)
+{
+       if (!IS_ENABLED(CONFIG_HIGHMEM))
+               return;
+       if (is_vmalloc_addr(addr))
+               vunmap(addr);
+	else
+       	kunmap_local(addr);
+}

(where vmap_folio() is a new function that works a lot like vmap(),
chunks of this get moved out-of-line, etc, etc., but this concept)

Does anyone have any better ideas?  If it'd be easy to map N pages
locally, for example ... looks like we only support up to 16 pages
mapped per CPU at any time, so mapping all of a 64kB folio would
almost always fail, and even mapping a 32kB folio would be unlikely
to succeed.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: folio_map
  2022-08-16 18:08 folio_map Matthew Wilcox
@ 2022-08-16 21:23 ` John Hubbard
  2022-08-17 10:29 ` folio_map Kirill A. Shutemov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: John Hubbard @ 2022-08-16 21:23 UTC (permalink / raw)
  To: Matthew Wilcox, Thomas Gleixner, Ira Weiny,
	Fabio M. De Francesco, Luis Chamberlain, linux-fsdevel, linux-mm,
	Mark Hairgrove

On 8/16/22 11:08, Matthew Wilcox wrote:
> Some of you will already know all this, but I'll go into a certain amount
> of detail for the peanut gallery.
> 
> One of the problems that people want to solve with multi-page folios
> is supporting filesystem block sizes > PAGE_SIZE.  Such filesystems
> already exist; you can happily create a 64kB block size filesystem on
> a PPC/ARM/... today, then fail to mount it on an x86 machine.
> 
> kmap_local_folio() only lets you map a single page from a folio.
> This works for the majority of cases (eg ->write_begin() works on a
> per-page basis *anyway*, so we can just map a single page from the folio).
> But this is somewhat hampering for ext2_get_page(), used for directory
> handling.  A directory record may cross a page boundary (because it
> wasn't a page boundary on the machine which created the filesystem),
> and juggling two pages being mapped at once is tricky with the stack
> model for kmap_local.
> 
> I don't particularly want to invest heavily in optimising for HIGHMEM.
> The number of machines which will use multi-page folios and HIGHMEM is
> not going to be large, one hopes, as 64-bit kernels are far more common.
> I'm happy for 32-bit to be slow, as long as it works.

Some of our kernel driver teams recently expressed precisely the same set
of requirements. And at first, I pointed them to folio_map_local(),
and then they schooled me by noting that, today, it only does a single
page. :)

> 
> For these reasons, I proposing the logical equivalent to this:
> 
> +void *folio_map_local(struct folio *folio)
> +{
> +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> +               return folio_address(folio);
> +       if (!folio_test_large(folio))
> +               return kmap_local_page(&folio->page);
> +       return vmap_folio(folio);
> +}

...which led to a desire for code very much like the above: kmap(),
with a fallback to vmap(). Always better to have such things in the
kernel, rather than a zillion copies in drivers.

Adding Mark Hairgrove in case I've missed any fine points?

> +
> +void folio_unmap_local(const void *addr)
> +{
> +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> +               return;
> +       if (is_vmalloc_addr(addr))
> +               vunmap(addr);
> +	else
> +       	kunmap_local(addr);
> +}
> 
> (where vmap_folio() is a new function that works a lot like vmap(),
> chunks of this get moved out-of-line, etc, etc., but this concept)
> 
> Does anyone have any better ideas?  If it'd be easy to map N pages
> locally, for example ... looks like we only support up to 16 pages
> mapped per CPU at any time, so mapping all of a 64kB folio would
> almost always fail, and even mapping a 32kB folio would be unlikely
> to succeed.
> 
> 

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: folio_map
  2022-08-16 18:08 folio_map Matthew Wilcox
  2022-08-16 21:23 ` folio_map John Hubbard
@ 2022-08-17 10:29 ` Kirill A. Shutemov
  2022-08-17 19:38   ` folio_map Matthew Wilcox
  2022-08-18  0:25 ` folio_map Dave Chinner
  2022-08-18 21:10 ` [RFC] vmap_folio() Matthew Wilcox
  3 siblings, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2022-08-17 10:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Thomas Gleixner, Ira Weiny, Fabio M. De Francesco,
	Luis Chamberlain, linux-fsdevel, linux-mm

On Tue, Aug 16, 2022 at 07:08:22PM +0100, Matthew Wilcox wrote:
> Some of you will already know all this, but I'll go into a certain amount
> of detail for the peanut gallery.
> 
> One of the problems that people want to solve with multi-page folios
> is supporting filesystem block sizes > PAGE_SIZE.  Such filesystems
> already exist; you can happily create a 64kB block size filesystem on
> a PPC/ARM/... today, then fail to mount it on an x86 machine.
> 
> kmap_local_folio() only lets you map a single page from a folio.
> This works for the majority of cases (eg ->write_begin() works on a
> per-page basis *anyway*, so we can just map a single page from the folio).
> But this is somewhat hampering for ext2_get_page(), used for directory
> handling.  A directory record may cross a page boundary (because it
> wasn't a page boundary on the machine which created the filesystem),
> and juggling two pages being mapped at once is tricky with the stack
> model for kmap_local.
> 
> I don't particularly want to invest heavily in optimising for HIGHMEM.
> The number of machines which will use multi-page folios and HIGHMEM is
> not going to be large, one hopes, as 64-bit kernels are far more common.
> I'm happy for 32-bit to be slow, as long as it works.
> 
> For these reasons, I proposing the logical equivalent to this:
> 
> +void *folio_map_local(struct folio *folio)
> +{
> +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> +               return folio_address(folio);
> +       if (!folio_test_large(folio))
> +               return kmap_local_page(&folio->page);
> +       return vmap_folio(folio);
> +}
> +
> +void folio_unmap_local(const void *addr)
> +{
> +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> +               return;
> +       if (is_vmalloc_addr(addr))
> +               vunmap(addr);
> +	else
> +       	kunmap_local(addr);
> +}
> 
> (where vmap_folio() is a new function that works a lot like vmap(),
> chunks of this get moved out-of-line, etc, etc., but this concept)

So it aims at replacing kmap_local_page(), but for folios, right?
kmap_local_page() interface can be used from any context, but vmap helpers
might_sleep(). How do we rectify this?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: folio_map
  2022-08-17 10:29 ` folio_map Kirill A. Shutemov
@ 2022-08-17 19:38   ` Matthew Wilcox
  2022-08-17 20:23     ` folio_map Ira Weiny
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2022-08-17 19:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ira Weiny, Fabio M. De Francesco,
	Luis Chamberlain, linux-fsdevel, linux-mm

On Wed, Aug 17, 2022 at 01:29:35PM +0300, Kirill A. Shutemov wrote:
> On Tue, Aug 16, 2022 at 07:08:22PM +0100, Matthew Wilcox wrote:
> > Some of you will already know all this, but I'll go into a certain amount
> > of detail for the peanut gallery.
> > 
> > One of the problems that people want to solve with multi-page folios
> > is supporting filesystem block sizes > PAGE_SIZE.  Such filesystems
> > already exist; you can happily create a 64kB block size filesystem on
> > a PPC/ARM/... today, then fail to mount it on an x86 machine.
> > 
> > kmap_local_folio() only lets you map a single page from a folio.
> > This works for the majority of cases (eg ->write_begin() works on a
> > per-page basis *anyway*, so we can just map a single page from the folio).
> > But this is somewhat hampering for ext2_get_page(), used for directory
> > handling.  A directory record may cross a page boundary (because it
> > wasn't a page boundary on the machine which created the filesystem),
> > and juggling two pages being mapped at once is tricky with the stack
> > model for kmap_local.
> > 
> > I don't particularly want to invest heavily in optimising for HIGHMEM.
> > The number of machines which will use multi-page folios and HIGHMEM is
> > not going to be large, one hopes, as 64-bit kernels are far more common.
> > I'm happy for 32-bit to be slow, as long as it works.
> > 
> > For these reasons, I proposing the logical equivalent to this:
> > 
> > +void *folio_map_local(struct folio *folio)
> > +{
> > +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> > +               return folio_address(folio);
> > +       if (!folio_test_large(folio))
> > +               return kmap_local_page(&folio->page);
> > +       return vmap_folio(folio);
> > +}
> > +
> > +void folio_unmap_local(const void *addr)
> > +{
> > +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> > +               return;
> > +       if (is_vmalloc_addr(addr))
> > +               vunmap(addr);
> > +	else
> > +       	kunmap_local(addr);
> > +}
> > 
> > (where vmap_folio() is a new function that works a lot like vmap(),
> > chunks of this get moved out-of-line, etc, etc., but this concept)
> 
> So it aims at replacing kmap_local_page(), but for folios, right?
> kmap_local_page() interface can be used from any context, but vmap helpers
> might_sleep(). How do we rectify this?

I'm not proposing getting rid of kmap_local_folio().  That should still
exist and work for users who need to use it in atomic context.  Indeed,
I'm intending to put a note in the doc for folio_map_local() suggesting
that users may prefer to use kmap_local_folio().  Good idea to put a
might_sleep() in folio_map_local() though.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: folio_map
  2022-08-17 19:38   ` folio_map Matthew Wilcox
@ 2022-08-17 20:23     ` Ira Weiny
  2022-08-17 20:52       ` folio_map Ira Weiny
  0 siblings, 1 reply; 13+ messages in thread
From: Ira Weiny @ 2022-08-17 20:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Thomas Gleixner, Fabio M. De Francesco,
	Luis Chamberlain, linux-fsdevel, linux-mm

On Wed, Aug 17, 2022 at 08:38:52PM +0100, Matthew Wilcox wrote:
> On Wed, Aug 17, 2022 at 01:29:35PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Aug 16, 2022 at 07:08:22PM +0100, Matthew Wilcox wrote:
> > > Some of you will already know all this, but I'll go into a certain amount
> > > of detail for the peanut gallery.
> > > 
> > > One of the problems that people want to solve with multi-page folios
> > > is supporting filesystem block sizes > PAGE_SIZE.  Such filesystems
> > > already exist; you can happily create a 64kB block size filesystem on
> > > a PPC/ARM/... today, then fail to mount it on an x86 machine.
> > > 
> > > kmap_local_folio() only lets you map a single page from a folio.
> > > This works for the majority of cases (eg ->write_begin() works on a
> > > per-page basis *anyway*, so we can just map a single page from the folio).
> > > But this is somewhat hampering for ext2_get_page(), used for directory
> > > handling.  A directory record may cross a page boundary (because it
> > > wasn't a page boundary on the machine which created the filesystem),
> > > and juggling two pages being mapped at once is tricky with the stack
> > > model for kmap_local.
> > > 
> > > I don't particularly want to invest heavily in optimising for HIGHMEM.
> > > The number of machines which will use multi-page folios and HIGHMEM is
> > > not going to be large, one hopes, as 64-bit kernels are far more common.
> > > I'm happy for 32-bit to be slow, as long as it works.
> > > 
> > > For these reasons, I proposing the logical equivalent to this:
> > > 
> > > +void *folio_map_local(struct folio *folio)
> > > +{
> > > +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> > > +               return folio_address(folio);
> > > +       if (!folio_test_large(folio))
> > > +               return kmap_local_page(&folio->page);
> > > +       return vmap_folio(folio);
> > > +}
> > > +
> > > +void folio_unmap_local(const void *addr)
> > > +{
> > > +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> > > +               return;
> > > +       if (is_vmalloc_addr(addr))
> > > +               vunmap(addr);
> > > +	else
> > > +       	kunmap_local(addr);
> > > +}
> > > 
> > > (where vmap_folio() is a new function that works a lot like vmap(),
> > > chunks of this get moved out-of-line, etc, etc., but this concept)
> > 
> > So it aims at replacing kmap_local_page(), but for folios, right?
> > kmap_local_page() interface can be used from any context, but vmap helpers
> > might_sleep(). How do we rectify this?
> 
> I'm not proposing getting rid of kmap_local_folio().  That should still
> exist and work for users who need to use it in atomic context.  Indeed,
> I'm intending to put a note in the doc for folio_map_local() suggesting
> that users may prefer to use kmap_local_folio().  Good idea to put a
> might_sleep() in folio_map_local() though.

There is also a semantic miss-match WRT the unmapping order.  But I think
Kirill brings up a bigger issue.

How many folios do you think will need to be mapped at a time?  And is there
any practical limit on their size?  Are 64k blocks a reasonable upper bound
until highmem can be deprecated completely?

I say this because I'm not sure that mapping a 64k block would always fail.
These mappings are transitory.  How often will a filesystem be mapping more
than 2 folios at once?

In our conversions most of the time 2 pages are mapped at once,
source/destination.

That said, to help ensure that a full folio map never fails we could increase
the number of pages supported by kmap_local_page().  At first, I was not a fan
but that would only be a penalty for HIGHMEM systems.  And as we are not
optimizing for such systems I'm not sure I see a downside to increasing the
limit to 32 or even 64.  I'm also inclined to believe that HIGHMEM systems are
smaller core counts.  So I don't think this is likely to multiply the space
wasted much.

Would doubling the support within kmap_local_page() be enough?

A final idea would be to hide the increase behind a 'support large block size
filesystems' config option under HIGHMEM systems.  But I'm really not sure that
is even needed.

Ira

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: folio_map
  2022-08-17 20:23     ` folio_map Ira Weiny
@ 2022-08-17 20:52       ` Ira Weiny
  2022-08-17 21:34         ` folio_map Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Ira Weiny @ 2022-08-17 20:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Thomas Gleixner, Fabio M. De Francesco,
	Luis Chamberlain, linux-fsdevel, linux-mm

On Wed, Aug 17, 2022 at 01:23:41PM -0700, Ira wrote:
> On Wed, Aug 17, 2022 at 08:38:52PM +0100, Matthew Wilcox wrote:
> > On Wed, Aug 17, 2022 at 01:29:35PM +0300, Kirill A. Shutemov wrote:
> > > On Tue, Aug 16, 2022 at 07:08:22PM +0100, Matthew Wilcox wrote:
> > > > Some of you will already know all this, but I'll go into a certain amount
> > > > of detail for the peanut gallery.
> > > > 
> > > > One of the problems that people want to solve with multi-page folios
> > > > is supporting filesystem block sizes > PAGE_SIZE.  Such filesystems
> > > > already exist; you can happily create a 64kB block size filesystem on
> > > > a PPC/ARM/... today, then fail to mount it on an x86 machine.
> > > > 
> > > > kmap_local_folio() only lets you map a single page from a folio.
> > > > This works for the majority of cases (eg ->write_begin() works on a
> > > > per-page basis *anyway*, so we can just map a single page from the folio).
> > > > But this is somewhat hampering for ext2_get_page(), used for directory
> > > > handling.  A directory record may cross a page boundary (because it
> > > > wasn't a page boundary on the machine which created the filesystem),
> > > > and juggling two pages being mapped at once is tricky with the stack
> > > > model for kmap_local.
> > > > 
> > > > I don't particularly want to invest heavily in optimising for HIGHMEM.
> > > > The number of machines which will use multi-page folios and HIGHMEM is
> > > > not going to be large, one hopes, as 64-bit kernels are far more common.
> > > > I'm happy for 32-bit to be slow, as long as it works.
> > > > 
> > > > For these reasons, I proposing the logical equivalent to this:
> > > > 
> > > > +void *folio_map_local(struct folio *folio)
> > > > +{
> > > > +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> > > > +               return folio_address(folio);
> > > > +       if (!folio_test_large(folio))
> > > > +               return kmap_local_page(&folio->page);
> > > > +       return vmap_folio(folio);
> > > > +}
> > > > +
> > > > +void folio_unmap_local(const void *addr)
> > > > +{
> > > > +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> > > > +               return;
> > > > +       if (is_vmalloc_addr(addr))
> > > > +               vunmap(addr);
> > > > +	else
> > > > +       	kunmap_local(addr);
> > > > +}
> > > > 
> > > > (where vmap_folio() is a new function that works a lot like vmap(),
> > > > chunks of this get moved out-of-line, etc, etc., but this concept)
> > > 
> > > So it aims at replacing kmap_local_page(), but for folios, right?
> > > kmap_local_page() interface can be used from any context, but vmap helpers
> > > might_sleep(). How do we rectify this?
> > 
> > I'm not proposing getting rid of kmap_local_folio().  That should still
> > exist and work for users who need to use it in atomic context.  Indeed,
> > I'm intending to put a note in the doc for folio_map_local() suggesting
> > that users may prefer to use kmap_local_folio().  Good idea to put a
> > might_sleep() in folio_map_local() though.
> 
> There is also a semantic miss-match WRT the unmapping order.  But I think
> Kirill brings up a bigger issue.
> 
> How many folios do you think will need to be mapped at a time?  And is there
> any practical limit on their size?  Are 64k blocks a reasonable upper bound
> until highmem can be deprecated completely?
> 
> I say this because I'm not sure that mapping a 64k block would always fail.
> These mappings are transitory.  How often will a filesystem be mapping more
> than 2 folios at once?

I did the math wrong but I think my idea can still work.

> 
> In our conversions most of the time 2 pages are mapped at once,
> source/destination.
> 
> That said, to help ensure that a full folio map never fails we could increase
> the number of pages supported by kmap_local_page().  At first, I was not a fan
> but that would only be a penalty for HIGHMEM systems.  And as we are not
> optimizing for such systems I'm not sure I see a downside to increasing the
> limit to 32 or even 64.  I'm also inclined to believe that HIGHMEM systems are
> smaller core counts.  So I don't think this is likely to multiply the space
> wasted much.
> 
> Would doubling the support within kmap_local_page() be enough?
> 
> A final idea would be to hide the increase behind a 'support large block size
> filesystems' config option under HIGHMEM systems.  But I'm really not sure that
> is even needed.
> 
> Ira
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: folio_map
  2022-08-17 20:52       ` folio_map Ira Weiny
@ 2022-08-17 21:34         ` Matthew Wilcox
  2022-08-18  1:28           ` folio_map Ira Weiny
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2022-08-17 21:34 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Kirill A. Shutemov, Thomas Gleixner, Fabio M. De Francesco,
	Luis Chamberlain, linux-fsdevel, linux-mm

On Wed, Aug 17, 2022 at 01:52:33PM -0700, Ira Weiny wrote:
> On Wed, Aug 17, 2022 at 01:23:41PM -0700, Ira wrote:
> > On Wed, Aug 17, 2022 at 08:38:52PM +0100, Matthew Wilcox wrote:
> > > On Wed, Aug 17, 2022 at 01:29:35PM +0300, Kirill A. Shutemov wrote:
> > > > On Tue, Aug 16, 2022 at 07:08:22PM +0100, Matthew Wilcox wrote:
> > > > > Some of you will already know all this, but I'll go into a certain amount
> > > > > of detail for the peanut gallery.
> > > > > 
> > > > > One of the problems that people want to solve with multi-page folios
> > > > > is supporting filesystem block sizes > PAGE_SIZE.  Such filesystems
> > > > > already exist; you can happily create a 64kB block size filesystem on
> > > > > a PPC/ARM/... today, then fail to mount it on an x86 machine.
> > > > > 
> > > > > kmap_local_folio() only lets you map a single page from a folio.
> > > > > This works for the majority of cases (eg ->write_begin() works on a
> > > > > per-page basis *anyway*, so we can just map a single page from the folio).
> > > > > But this is somewhat hampering for ext2_get_page(), used for directory
> > > > > handling.  A directory record may cross a page boundary (because it
> > > > > wasn't a page boundary on the machine which created the filesystem),
> > > > > and juggling two pages being mapped at once is tricky with the stack
> > > > > model for kmap_local.
> > > > > 
> > > > > I don't particularly want to invest heavily in optimising for HIGHMEM.
> > > > > The number of machines which will use multi-page folios and HIGHMEM is
> > > > > not going to be large, one hopes, as 64-bit kernels are far more common.
> > > > > I'm happy for 32-bit to be slow, as long as it works.
> > > > > 
> > > > > For these reasons, I proposing the logical equivalent to this:
> > > > > 
> > > > > +void *folio_map_local(struct folio *folio)
> > > > > +{
> > > > > +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> > > > > +               return folio_address(folio);
> > > > > +       if (!folio_test_large(folio))
> > > > > +               return kmap_local_page(&folio->page);
> > > > > +       return vmap_folio(folio);
> > > > > +}
> > > > > +
> > > > > +void folio_unmap_local(const void *addr)
> > > > > +{
> > > > > +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> > > > > +               return;
> > > > > +       if (is_vmalloc_addr(addr))
> > > > > +               vunmap(addr);
> > > > > +	else
> > > > > +       	kunmap_local(addr);
> > > > > +}
> > > > > 
> > > > > (where vmap_folio() is a new function that works a lot like vmap(),
> > > > > chunks of this get moved out-of-line, etc, etc., but this concept)
> > > > 
> > > > So it aims at replacing kmap_local_page(), but for folios, right?
> > > > kmap_local_page() interface can be used from any context, but vmap helpers
> > > > might_sleep(). How do we rectify this?
> > > 
> > > I'm not proposing getting rid of kmap_local_folio().  That should still
> > > exist and work for users who need to use it in atomic context.  Indeed,
> > > I'm intending to put a note in the doc for folio_map_local() suggesting
> > > that users may prefer to use kmap_local_folio().  Good idea to put a
> > > might_sleep() in folio_map_local() though.
> > 
> > There is also a semantic miss-match WRT the unmapping order.  But I think
> > Kirill brings up a bigger issue.

I don't see the semantic mismatch?

> > How many folios do you think will need to be mapped at a time?  And is there
> > any practical limit on their size?  Are 64k blocks a reasonable upper bound
> > until highmem can be deprecated completely?
> > 
> > I say this because I'm not sure that mapping a 64k block would always fail.
> > These mappings are transitory.  How often will a filesystem be mapping more
> > than 2 folios at once?
> 
> I did the math wrong but I think my idea can still work.

The thing is that kmap_local_page() can be called from interrupt context
(how often is it?  no idea).  So you map two 64kB folios (at 16 entries
each) and that consumes 32 entries for this CPU, now you take an interrupt
and that's 33.  I don't know how deep that goes; can we have some mapped
in userspace, some mapped in softirq and then another interrupt causes
more to be mapped in hardirq?  I don't really want to find out, so I'd
rather always punt to vmap() for multipage folios.

Is there a reason you want to make folio_map_local() more efficient
on HIGHMEM systems?

> > 
> > In our conversions most of the time 2 pages are mapped at once,
> > source/destination.
> > 
> > That said, to help ensure that a full folio map never fails we could increase
> > the number of pages supported by kmap_local_page().  At first, I was not a fan
> > but that would only be a penalty for HIGHMEM systems.  And as we are not
> > optimizing for such systems I'm not sure I see a downside to increasing the
> > limit to 32 or even 64.  I'm also inclined to believe that HIGHMEM systems are
> > smaller core counts.  So I don't think this is likely to multiply the space
> > wasted much.
> > 
> > Would doubling the support within kmap_local_page() be enough?
> > 
> > A final idea would be to hide the increase behind a 'support large block size
> > filesystems' config option under HIGHMEM systems.  But I'm really not sure that
> > is even needed.
> > 
> > Ira
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: folio_map
  2022-08-16 18:08 folio_map Matthew Wilcox
  2022-08-16 21:23 ` folio_map John Hubbard
  2022-08-17 10:29 ` folio_map Kirill A. Shutemov
@ 2022-08-18  0:25 ` Dave Chinner
  2022-08-18 21:10 ` [RFC] vmap_folio() Matthew Wilcox
  3 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2022-08-18  0:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Thomas Gleixner, Ira Weiny, Fabio M. De Francesco,
	Luis Chamberlain, linux-fsdevel, linux-mm

On Tue, Aug 16, 2022 at 07:08:22PM +0100, Matthew Wilcox wrote:
> Some of you will already know all this, but I'll go into a certain amount
> of detail for the peanut gallery.
> 
> One of the problems that people want to solve with multi-page folios
> is supporting filesystem block sizes > PAGE_SIZE.  Such filesystems
> already exist; you can happily create a 64kB block size filesystem on
> a PPC/ARM/... today, then fail to mount it on an x86 machine.

The XFS buffer cache already supports 64kB block sizes on 4kB page
size machines - we do this with bulk page allocation and
vm_map_ram()/vm_unmap_ram() of the page arrays that are built.

These mappings are persistent (i.e. cannot be local), but if you
want to prototype something before the page cache has been
completely modified to support BS > PS, then the XFS buffer
cache already does what you need. Just make XFS filesystems with
"-n size=64k" to use directory block sizes of 64kB and do lots of
work with directory operations on large directories.

> kmap_local_folio() only lets you map a single page from a folio.
> This works for the majority of cases (eg ->write_begin() works on a
> per-page basis *anyway*, so we can just map a single page from the folio).
> But this is somewhat hampering for ext2_get_page(), used for directory
> handling.  A directory record may cross a page boundary (because it
> wasn't a page boundary on the machine which created the filesystem),
> and juggling two pages being mapped at once is tricky with the stack
> model for kmap_local.

Yup, that's exactly the problem we avoid by using mapped buffers in
XFS.

> I don't particularly want to invest heavily in optimising for HIGHMEM.
> The number of machines which will use multi-page folios and HIGHMEM is
> not going to be large, one hopes, as 64-bit kernels are far more common.
> I'm happy for 32-bit to be slow, as long as it works.

Fully agree.

> For these reasons, I proposing the logical equivalent to this:
> 
> +void *folio_map_local(struct folio *folio)
> +{
> +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> +               return folio_address(folio);
> +       if (!folio_test_large(folio))
> +               return kmap_local_page(&folio->page);
> +       return vmap_folio(folio);
> +}
> +
> +void folio_unmap_local(const void *addr)
> +{
> +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> +               return;
> +       if (is_vmalloc_addr(addr))
> +               vunmap(addr);
> +	else
> +       	kunmap_local(addr);
> +}
> 
> (where vmap_folio() is a new function that works a lot like vmap(),
> chunks of this get moved out-of-line, etc, etc., but this concept)

*nod*

> Does anyone have any better ideas?  If it'd be easy to map N pages
> locally, for example ... looks like we only support up to 16 pages
> mapped per CPU at any time, so mapping all of a 64kB folio would
> almost always fail, and even mapping a 32kB folio would be unlikely
> to succeed.

FWIW, what I really want for the XFS buffer cache is a large folio
aware variant of vm_map_ram/vm_unmap_ram(). i.e. something we can
pass a random assortment of folios into, and it just does the right
thing to create a persistent contiguous mapping of the folios.

i.e. we have an allocation loop that tries to allocate large folios,
but then falls back to smaller folios if the large allocation cannot
be fulfilled without blocking. Then the mapping function works with
whatever we managed to allocate in the most optimal way....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: folio_map
  2022-08-17 21:34         ` folio_map Matthew Wilcox
@ 2022-08-18  1:28           ` Ira Weiny
  0 siblings, 0 replies; 13+ messages in thread
From: Ira Weiny @ 2022-08-18  1:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kirill A. Shutemov, Thomas Gleixner, Fabio M. De Francesco,
	Luis Chamberlain, linux-fsdevel, linux-mm

On Wed, Aug 17, 2022 at 10:34:05PM +0100, Matthew Wilcox wrote:
> On Wed, Aug 17, 2022 at 01:52:33PM -0700, Ira Weiny wrote:
> > On Wed, Aug 17, 2022 at 01:23:41PM -0700, Ira wrote:
 
[snip]

> > > How many folios do you think will need to be mapped at a time?  And is there
> > > any practical limit on their size?  Are 64k blocks a reasonable upper bound
> > > until highmem can be deprecated completely?
> > > 
> > > I say this because I'm not sure that mapping a 64k block would always fail.
> > > These mappings are transitory.  How often will a filesystem be mapping more
> > > than 2 folios at once?
> > 
> > I did the math wrong but I think my idea can still work.
> 
> The thing is that kmap_local_page() can be called from interrupt context
> (how often is it?  no idea).  So you map two 64kB folios (at 16 entries
> each) and that consumes 32 entries for this CPU, now you take an interrupt
> and that's 33.  I don't know how deep that goes; can we have some mapped
> in userspace, some mapped in softirq and then another interrupt causes
> more to be mapped in hardirq?  I don't really want to find out, so I'd
> rather always punt to vmap() for multipage folios.
> 
> Is there a reason you want to make folio_map_local() more efficient
> on HIGHMEM systems?

It is not about efficiency.  It is about making the callers of
folio_map_local() not have to worry about the context it is used in.

If 64 entries is not enough how many?  I'm trying to see if there is any bound
we could reasonably establish?  Even kmap_local_page() _may_ run out of
entries.

Ira

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC] vmap_folio()
  2022-08-16 18:08 folio_map Matthew Wilcox
                   ` (2 preceding siblings ...)
  2022-08-18  0:25 ` folio_map Dave Chinner
@ 2022-08-18 21:10 ` Matthew Wilcox
  2022-08-19 10:53   ` Uladzislau Rezki
  3 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2022-08-18 21:10 UTC (permalink / raw)
  To: Thomas Gleixner, Ira Weiny, Fabio M. De Francesco,
	Luis Chamberlain, linux-fsdevel, linux-mm, Uladzislau Rezki

On Tue, Aug 16, 2022 at 07:08:22PM +0100, Matthew Wilcox wrote:
> For these reasons, I proposing the logical equivalent to this:
> 
> +void *folio_map_local(struct folio *folio)
> +{
> +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> +               return folio_address(folio);
> +       if (!folio_test_large(folio))
> +               return kmap_local_page(&folio->page);
> +       return vmap_folio(folio);
> +}
> 
> (where vmap_folio() is a new function that works a lot like vmap(),
> chunks of this get moved out-of-line, etc, etc., but this concept)

This vmap_folio() compiles but is otherwise untested.  Anything I
obviously got wrong here?

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index dd6cdb201195..1867759c33ff 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2848,6 +2848,42 @@ void *vmap(struct page **pages, unsigned int count,
 }
 EXPORT_SYMBOL(vmap);
 
+#ifdef CONFIG_HIGHMEM
+/**
+ * vmap_folio - Map an entire folio into virtually contiguous space
+ * @folio: The folio to map.
+ *
+ * Maps all pages in @folio into contiguous kernel virtual space.  This
+ * function is only available in HIGHMEM builds; for !HIGHMEM, use
+ * folio_address().  The pages are mapped with PAGE_KERNEL permissions.
+ *
+ * Return: The address of the area or %NULL on failure
+ */
+void *vmap_folio(struct folio *folio)
+{
+	size_t size = folio_size(folio);
+	struct vm_struct *area;
+	unsigned long addr;
+
+	might_sleep();
+
+	area = get_vm_area_caller(size, VM_MAP, __builtin_return_address(0));
+	if (!area)
+		return NULL;
+
+	addr = (unsigned long)area->addr;
+	if (vmap_range_noflush(addr, addr + size,
+				folio_pfn(folio) << PAGE_SHIFT,
+				PAGE_KERNEL, folio_shift(folio))) {
+		vunmap(area->addr);
+		return NULL;
+	}
+	flush_cache_vmap(addr, addr + size);
+
+	return area->addr;
+}
+#endif
+
 #ifdef CONFIG_VMAP_PFN
 struct vmap_pfn_data {
 	unsigned long	*pfns;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC] vmap_folio()
  2022-08-18 21:10 ` [RFC] vmap_folio() Matthew Wilcox
@ 2022-08-19 10:53   ` Uladzislau Rezki
  2022-08-19 15:45     ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Uladzislau Rezki @ 2022-08-19 10:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Thomas Gleixner, Ira Weiny, Fabio M. De Francesco,
	Luis Chamberlain, linux-fsdevel, linux-mm, Uladzislau Rezki

On Thu, Aug 18, 2022 at 10:10:14PM +0100, Matthew Wilcox wrote:
> On Tue, Aug 16, 2022 at 07:08:22PM +0100, Matthew Wilcox wrote:
> > For these reasons, I proposing the logical equivalent to this:
> > 
> > +void *folio_map_local(struct folio *folio)
> > +{
> > +       if (!IS_ENABLED(CONFIG_HIGHMEM))
> > +               return folio_address(folio);
> > +       if (!folio_test_large(folio))
> > +               return kmap_local_page(&folio->page);
> > +       return vmap_folio(folio);
> > +}
> > 
> > (where vmap_folio() is a new function that works a lot like vmap(),
> > chunks of this get moved out-of-line, etc, etc., but this concept)
> 
> This vmap_folio() compiles but is otherwise untested.  Anything I
> obviously got wrong here?
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index dd6cdb201195..1867759c33ff 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2848,6 +2848,42 @@ void *vmap(struct page **pages, unsigned int count,
>  }
>  EXPORT_SYMBOL(vmap);
>  
> +#ifdef CONFIG_HIGHMEM
> +/**
> + * vmap_folio - Map an entire folio into virtually contiguous space
> + * @folio: The folio to map.
> + *
> + * Maps all pages in @folio into contiguous kernel virtual space.  This
> + * function is only available in HIGHMEM builds; for !HIGHMEM, use
> + * folio_address().  The pages are mapped with PAGE_KERNEL permissions.
> + *
> + * Return: The address of the area or %NULL on failure
> + */
> +void *vmap_folio(struct folio *folio)
> +{
> +	size_t size = folio_size(folio);
> +	struct vm_struct *area;
> +	unsigned long addr;
> +
> +	might_sleep();
> +
> +	area = get_vm_area_caller(size, VM_MAP, __builtin_return_address(0));
> +	if (!area)
> +		return NULL;
> +
> +	addr = (unsigned long)area->addr;
> +	if (vmap_range_noflush(addr, addr + size,
> +				folio_pfn(folio) << PAGE_SHIFT,
> +				PAGE_KERNEL, folio_shift(folio))) {
> +		vunmap(area->addr);
> +		return NULL;
> +	}
> +	flush_cache_vmap(addr, addr + size);
> +
> +	return area->addr;
> +}
> +#endif
> +
Looks pretty straightforward. One thing though, if we can combine it
together with vmap(), since it is a copy paste in some sense, say to
have something __vmap() to reuse it in the vmap_folio() and vmap().

But that is just a thought.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] vmap_folio()
  2022-08-19 10:53   ` Uladzislau Rezki
@ 2022-08-19 15:45     ` Matthew Wilcox
  2022-08-22 19:54       ` Uladzislau Rezki
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2022-08-19 15:45 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Thomas Gleixner, Ira Weiny, Fabio M. De Francesco,
	Luis Chamberlain, linux-fsdevel, linux-mm

On Fri, Aug 19, 2022 at 12:53:32PM +0200, Uladzislau Rezki wrote:
> Looks pretty straightforward. One thing though, if we can combine it
> together with vmap(), since it is a copy paste in some sense, say to
> have something __vmap() to reuse it in the vmap_folio() and vmap().
> 
> But that is just a thought.

Thanks for looking it over!

Combining it with vmap() or vm_map_ram() is tricky.  Today, we assume
that each struct page pointer refers to exactly PAGE_SIZE bytes, so if
somebody calls alloc_pages(GFP_COMPOUND, 4) and then passes the head
page to vmap(), only that one page gets mapped.  I don't know whether
any current callers depend on that behaviour.

Now that I look at the future customers of this, I think I erred in basing
this on vmap(), it looks like vm_map_ram() is preferred.  So I'll redo
based on vm_map_ram().


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] vmap_folio()
  2022-08-19 15:45     ` Matthew Wilcox
@ 2022-08-22 19:54       ` Uladzislau Rezki
  0 siblings, 0 replies; 13+ messages in thread
From: Uladzislau Rezki @ 2022-08-22 19:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Uladzislau Rezki, Thomas Gleixner, Ira Weiny,
	Fabio M. De Francesco, Luis Chamberlain, linux-fsdevel, linux-mm

On Fri, Aug 19, 2022 at 04:45:05PM +0100, Matthew Wilcox wrote:
> On Fri, Aug 19, 2022 at 12:53:32PM +0200, Uladzislau Rezki wrote:
> > Looks pretty straightforward. One thing though, if we can combine it
> > together with vmap(), since it is a copy paste in some sense, say to
> > have something __vmap() to reuse it in the vmap_folio() and vmap().
> > 
> > But that is just a thought.
> 
> Thanks for looking it over!
> 
You are welcome.

>
> Combining it with vmap() or vm_map_ram() is tricky.  Today, we assume
> that each struct page pointer refers to exactly PAGE_SIZE bytes, so if
> somebody calls alloc_pages(GFP_COMPOUND, 4) and then passes the head
> page to vmap(), only that one page gets mapped.  I don't know whether
> any current callers depend on that behaviour.
> 
> Now that I look at the future customers of this, I think I erred in basing
> this on vmap(), it looks like vm_map_ram() is preferred.  So I'll redo
> based on vm_map_ram().
> 
Indeed, the vmap code has no knowledge about comound pages. You can add
me to CC, so i can have a look on it to help out with it if there will
be a need.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-08-22 19:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-16 18:08 folio_map Matthew Wilcox
2022-08-16 21:23 ` folio_map John Hubbard
2022-08-17 10:29 ` folio_map Kirill A. Shutemov
2022-08-17 19:38   ` folio_map Matthew Wilcox
2022-08-17 20:23     ` folio_map Ira Weiny
2022-08-17 20:52       ` folio_map Ira Weiny
2022-08-17 21:34         ` folio_map Matthew Wilcox
2022-08-18  1:28           ` folio_map Ira Weiny
2022-08-18  0:25 ` folio_map Dave Chinner
2022-08-18 21:10 ` [RFC] vmap_folio() Matthew Wilcox
2022-08-19 10:53   ` Uladzislau Rezki
2022-08-19 15:45     ` Matthew Wilcox
2022-08-22 19:54       ` Uladzislau Rezki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.