All of lore.kernel.org
 help / color / mirror / Atom feed
* Struct page proposal
@ 2021-09-23  1:21 Kent Overstreet
  2021-09-23  3:23 ` Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Kent Overstreet @ 2021-09-23  1:21 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, linux-mm
  Cc: Johannes Weiner, Matthew Wilcox, Linus Torvalds, Andrew Morton,
	Darrick J. Wong, Christoph Hellwig, David Howells

One thing that's come out of the folios discussions with both Matthew and
Johannes is that we seem to be thinking along similar lines regarding our end
goals for struct page.

The fundamental reason for struct page is that we need memory to be self
describing, without any context - we need to be able to go from a generic
untyped struct page and figure out what it contains: handling physical memory
failure is the most prominent example, but migration and compaction are more
common. We need to be able to ask the thing that owns a page of memory "hey,
stop using this and move your stuff here".

Matthew's helpfully been coming up with a list of page types:
https://kernelnewbies.org/MemoryTypes

But struct page could be a lot smaller than it is now. I think we can get it
down to two pointers, which means it'll take up 0.4% of system memory. Both
Matthew and Johannes have ideas for getting it down even further - the main
thing to note is that virt_to_page() _should_ be an uncommon operation (most of
the places we're currently using it are completely unnecessary, look at all the
places we're using it on the zero page). Johannes is thinking two layer radix
tree, Matthew was thinking about using maple trees - personally, I think that
0.4% of system memory is plenty good enough.


Ok, but what do we do with the stuff currently in struct page?
-------------------------------------------------------------

The main thing to note is that since in normal operation most folios are going
to be describing many pages, not just one - and we'll be using _less_ memory
overall if we allocate them separately. That's cool.

Of course, for this to make sense, we'll have to get all the other stuff in
struct page moved into their own types, but file & anon pages are the big one,
and that's already being tackled.

Why two ulongs/pointers, instead of just one?
---------------------------------------------

Because one of the things we really want and don't have now is a clean division
between allocator and allocatee state. Allocator meaning either the buddy
allocator or slab, allocatee state would be the folio or the network pool state
or whatever actually called kmalloc() or alloc_pages().

Right now slab state sits in the same place in struct page where allocatee state
does, and the reason this is bad is that slab/slub are a hell of a lot faster
than the buddy allocator, and Johannes wants to move the boundary between slab
allocations and buddy allocator allocations up to like 64k. If we fix where slab
state lives, this will become completely trivial to do.

So if we have this:

struct page {
	unsigned long	allocator;
	unsigned long	allocatee;
};

The allocator field would be used for either a pointer to slab/slub's state, if
it's a slab page, or if it's a buddy allocator page it'd encode the order of the
allocation - like compound order today, and probably whether or not the
(compound group of) pages is free.

The allocatee field would be used for a type tagged (using the low bits of the
pointer) to one of:
 - struct folio
 - struct anon_folio, if that becomes a thing
 - struct network_pool_page
 - struct pte_page
 - struct zone_device_page

Then we can further refactor things until all the stuff that's currently crammed
in struct page lives in types where each struct field means one and precisely
one thing, and also where we can freely reshuffle and reorganize and add stuff
to the various types where we couldn't before because it'd make struct page
bigger.

Other notes & potential issues:
 - page->compound_dtor needs to die

 - page->rcu_head moves into the types that actually need it, no issues there

 - page->refcount has question marks around it. I think we can also just move it
   into the types that need it; with RCU derefing the pointer to the folio or
   whatever and grabing a ref on folio->refcount can happen under a RCU read
   lock - there's no real question about whether it's technically possible to
   get it out of struct page, and I think it would be cleaner overall that way.

   However, depending on how it's used from code paths that go from generic
   untyped pages, I could see it turning into more of a hassle than it's worth.
   More investigation is needed.

 - page->memcg_data - I don't know whether that one more properly belongs in
   struct page or in the page subtypes - I'd love it if Johannes could talk
   about that one.

 - page->flags - dealing with this is going to be a huge hassle but also where
   we'll find some of the biggest gains in overall sanity and readability of the
   code. Right now, PG_locked is super special and ad hoc and I have run into
   situations multiple times (and Johannes was in vehement agreement on this
   one) where I simply could not figure the behaviour of the current code re:
   who is responsible for locking pages without instrumenting the code with
   assertions.

   Meaning anything we do to create and enforce module boundaries between
   different chunks of code is going to suck, but the end result should be
   really worthwhile.

Matthew Wilcox and David Howells have been having conversations on IRC about
what to do about other page bits. It appears we should be able to kill a lot of
filesystem usage of both PG_private and PG_private_2 - filesystems in general
hang state off of page->private, soon to be folio->private, and PG_private in
current use just indicates whether page->private is nonzero - meaning it's
completely redundant.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-23  1:21 Struct page proposal Kent Overstreet
@ 2021-09-23  3:23 ` Matthew Wilcox
  2021-09-23  5:15   ` Kent Overstreet
  2021-09-23  9:03 ` Struct page proposal David Hildenbrand
  2021-09-27 17:48 ` Vlastimil Babka
  2 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2021-09-23  3:23 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-fsdevel, linux-kernel, linux-mm, Johannes Weiner,
	Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On Wed, Sep 22, 2021 at 09:21:31PM -0400, Kent Overstreet wrote:
> The fundamental reason for struct page is that we need memory to be self
> describing, without any context - we need to be able to go from a generic
> untyped struct page and figure out what it contains: handling physical memory
> failure is the most prominent example, but migration and compaction are more
> common. We need to be able to ask the thing that owns a page of memory "hey,
> stop using this and move your stuff here".

Yup, and another thing we need is to take any page mapped to userspace
and mark it as dirty (whatever that means for the owner of the page).

> Matthew's helpfully been coming up with a list of page types:
> https://kernelnewbies.org/MemoryTypes
> 
> But struct page could be a lot smaller than it is now. I think we can get it
> down to two pointers, which means it'll take up 0.4% of system memory. Both
> Matthew and Johannes have ideas for getting it down even further - the main
> thing to note is that virt_to_page() _should_ be an uncommon operation (most of
> the places we're currently using it are completely unnecessary, look at all the
> places we're using it on the zero page). Johannes is thinking two layer radix
> tree, Matthew was thinking about using maple trees - personally, I think that
> 0.4% of system memory is plenty good enough.

As with a lot of these future plans, I think the details can vary
slightly.  What I propose on the above wiki page is to take it
down to one pointer per page, but yes, I have a dream that eventually we
can take it down to one pointer + size per allocation (so 16 bytes)
rather than 16 bytes per page.

> Ok, but what do we do with the stuff currently in struct page?
> -------------------------------------------------------------
> 
> The main thing to note is that since in normal operation most folios are going
> to be describing many pages, not just one - and we'll be using _less_ memory
> overall if we allocate them separately. That's cool.
> 
> Of course, for this to make sense, we'll have to get all the other stuff in
> struct page moved into their own types, but file & anon pages are the big one,
> and that's already being tackled.

We can also allocate a far larger structure.  eg, we might decide that
a file page looks like this:

struct folio {
    unsigned long flags;
    unsigned long pfn;
    struct list_head lru;
    struct address_space *mapping;
    pgoff_t index;
    void *private;
    atomic_t _mapcount;
    atomic_t _refcount;
#ifdef CONFIG_MEMCG
    unsigned long memcg_data;
#endif
    unsigned char dtor;
    unsigned char order;
    atomic_t hmapcount;
    unsigned int nr_pages;
    atomic_t hpinned_count;
    struct list_head deferred_list;
};

(compiling that list reminds me that we'll need to sort out mapcount
on subpages when it comes time to do this.  ask me if you don't know
what i'm talking about here.)

> Why two ulongs/pointers, instead of just one?
> ---------------------------------------------
> 
> Because one of the things we really want and don't have now is a clean division
> between allocator and allocatee state. Allocator meaning either the buddy
> allocator or slab, allocatee state would be the folio or the network pool state
> or whatever actually called kmalloc() or alloc_pages().
> 
> Right now slab state sits in the same place in struct page where allocatee state
> does, and the reason this is bad is that slab/slub are a hell of a lot faster
> than the buddy allocator, and Johannes wants to move the boundary between slab
> allocations and buddy allocator allocations up to like 64k. If we fix where slab
> state lives, this will become completely trivial to do.
> 
> So if we have this:
> 
> struct page {
> 	unsigned long	allocator;
> 	unsigned long	allocatee;
> };
> 
> The allocator field would be used for either a pointer to slab/slub's state, if
> it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> allocation - like compound order today, and probably whether or not the
> (compound group of) pages is free.
> 
> The allocatee field would be used for a type tagged (using the low bits of the
> pointer) to one of:
>  - struct folio
>  - struct anon_folio, if that becomes a thing
>  - struct network_pool_page
>  - struct pte_page
>  - struct zone_device_page

I think we /can/ do all this.  I don't know that it's the right thing to
do.  And I really mean that.  I genuinely don't know that "allocate
file pages from slab" will solve any problems at all.  And I kind of
don't want to investigate that until later.

By the way, another way we could do this is to put the 'allocator'
field into the allocatee's data structure.  eg the first word
in struct folio could point to the struct slab that contains it.

> Other notes & potential issues:
>  - page->compound_dtor needs to die

The reason we have it right now is that the last person to call
put_page() may not be the one who allocated it.  _maybe_ we can do
all-of-the-dtor-stuff when the person who allocates it frees it, and
have put_page() only free the memory.  TBD.

>  - page->rcu_head moves into the types that actually need it, no issues there

Hope so!

>  - page->refcount has question marks around it. I think we can also just move it
>    into the types that need it; with RCU derefing the pointer to the folio or
>    whatever and grabing a ref on folio->refcount can happen under a RCU read
>    lock - there's no real question about whether it's technically possible to
>    get it out of struct page, and I think it would be cleaner overall that way.
> 
>    However, depending on how it's used from code paths that go from generic
>    untyped pages, I could see it turning into more of a hassle than it's worth.
>    More investigation is needed.

I think this depends how far we go splitting everything apart.

>  - page->memcg_data - I don't know whether that one more properly belongs in
>    struct page or in the page subtypes - I'd love it if Johannes could talk
>    about that one.

Johannes certainly knows more about this than I do.  I think it's needed
for anon folios, file folios and slab, but maybe it's needed for page
tables too?

>  - page->flags - dealing with this is going to be a huge hassle but also where
>    we'll find some of the biggest gains in overall sanity and readability of the
>    code. Right now, PG_locked is super special and ad hoc and I have run into
>    situations multiple times (and Johannes was in vehement agreement on this
>    one) where I simply could not figure the behaviour of the current code re:
>    who is responsible for locking pages without instrumenting the code with
>    assertions.
> 
>    Meaning anything we do to create and enforce module boundaries between
>    different chunks of code is going to suck, but the end result should be
>    really worthwhile.
> 
> Matthew Wilcox and David Howells have been having conversations on IRC about
> what to do about other page bits. It appears we should be able to kill a lot of
> filesystem usage of both PG_private and PG_private_2 - filesystems in general
> hang state off of page->private, soon to be folio->private, and PG_private in
> current use just indicates whether page->private is nonzero - meaning it's
> completely redundant.

Also I want to kill PG_error.  PG_slab and PG_hwpoison become page
types (in the "allocatee" field in your parlance).  PG_head goes away
naturally.

Something we don't have to talk about right now is that there's no reason
several non-contiguous pages can't have the same 'allocatee' value.
I'm thinking that every page allocated to a given vmalloc allocation
would point to the same vm_struct.  So I'm thinking that the way all of
the above would work is we'd allocate a "struct folio" from slab, then
pass the (tagged) pointer to alloc_pages().  alloc_pages() would fill
in the 'allocatee' pointer for each of the struct pages with whatever
pointer it is given.

There's probably a bunch of holes in the above handwaving, but I'm
pretty confident we can fill them.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-23  3:23 ` Matthew Wilcox
@ 2021-09-23  5:15   ` Kent Overstreet
  2021-09-23 11:40     ` Mapcount of subpages Matthew Wilcox
  0 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2021-09-23  5:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-mm, Johannes Weiner,
	Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On Thu, Sep 23, 2021 at 04:23:12AM +0100, Matthew Wilcox wrote:
> On Wed, Sep 22, 2021 at 09:21:31PM -0400, Kent Overstreet wrote:
> > The fundamental reason for struct page is that we need memory to be self
> > describing, without any context - we need to be able to go from a generic
> > untyped struct page and figure out what it contains: handling physical memory
> > failure is the most prominent example, but migration and compaction are more
> > common. We need to be able to ask the thing that owns a page of memory "hey,
> > stop using this and move your stuff here".
> 
> Yup, and another thing we need is to take any page mapped to userspace
> and mark it as dirty (whatever that means for the owner of the page).

Yeah so that leads into another discussion, which is that pages have a public
interface that can be called by outside code - via page->mapping->a_ops - but
it's not particularly clear or documented anywhere that I've seen what that
interface is. We have stuff in a_ops that is definitely _not_ part of the public
interface and is really more for internal filesystem use - write_begin,
write_end, .readpage - like half of a_ops, really. It would be nice if as part
of these cleanups we could separate out the actual public interface and nail it
down and document it better.

Most if not all the stuff in a_ops that's for internal fs use really doesn't
need to be in an ops struct, they could be passed to the filemap.c functions
that use them - this would be a style improvement, it makes it clearer when you
pass a function pointer directly to the function that's going to use it (e.g.
passing write_begin and write _end to generic_file_buffered_read).

> We can also allocate a far larger structure.  eg, we might decide that
> a file page looks like this:

Yes! At the same time we should be trying to come up with cleanups that just
completele delete some of these things, but just having the freedom to not have
to care about shaving every single byte and laying things out in a sane way so
we can see at a glance what there is is going to make those cleanups that much
easier.

> (compiling that list reminds me that we'll need to sort out mapcount
> on subpages when it comes time to do this.  ask me if you don't know
> what i'm talking about here.)

I am curious why we would ever need a mapcount for just part of a page, tell me
more.

> I think we /can/ do all this.  I don't know that it's the right thing to
> do.  And I really mean that.  I genuinely don't know that "allocate
> file pages from slab" will solve any problems at all.  And I kind of
> don't want to investigate that until later.

The idea isn't "allocate file pages from slab", it's "do all allocations < 64k
(or whatever we decide) from slab" - because if you look at enough profiles
there are plenty of workloads where this really is a real world issue, and if we
can regigger things so that code outside mm/ literally does not care whether it
uses slab or alloc_pages(), why not do it? It's the cleanest approach.

> By the way, another way we could do this is to put the 'allocator'
> field into the allocatee's data structure.  eg the first word
> in struct folio could point to the struct slab that contains it.
> 
> > Other notes & potential issues:
> >  - page->compound_dtor needs to die
> 
> The reason we have it right now is that the last person to call
> put_page() may not be the one who allocated it.  _maybe_ we can do
> all-of-the-dtor-stuff when the person who allocates it frees it, and
> have put_page() only free the memory.  TBD.

We have it because hugepages and transhuge pages are "special" and special in
different ways. They should probably just be bits in the internal allocator
state, and hopefully compound_page_dtor and transhuge_dtor can just be deleted
(the hugepage dtor actually does do real things involving the hugepage reserve).

> Something we don't have to talk about right now is that there's no reason
> several non-contiguous pages can't have the same 'allocatee' value.
> I'm thinking that every page allocated to a given vmalloc allocation
> would point to the same vm_struct.  So I'm thinking that the way all of
> the above would work is we'd allocate a "struct folio" from slab, then
> pass the (tagged) pointer to alloc_pages().  alloc_pages() would fill
> in the 'allocatee' pointer for each of the struct pages with whatever
> pointer it is given.

There's also vmap to think about. What restrictions, if any, are there on what
kinds of pages can be passed to vmap()? I see it used in a lot of driver code,
and I wonder if I even dare ask what for.

> There's probably a bunch of holes in the above handwaving, but I'm
> pretty confident we can fill them.

Fun times!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-23  1:21 Struct page proposal Kent Overstreet
  2021-09-23  3:23 ` Matthew Wilcox
@ 2021-09-23  9:03 ` David Hildenbrand
  2021-09-23 15:22   ` Kent Overstreet
  2021-09-27 17:48 ` Vlastimil Babka
  2 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2021-09-23  9:03 UTC (permalink / raw)
  To: Kent Overstreet, linux-fsdevel, linux-kernel, linux-mm
  Cc: Johannes Weiner, Matthew Wilcox, Linus Torvalds, Andrew Morton,
	Darrick J. Wong, Christoph Hellwig, David Howells

On 23.09.21 03:21, Kent Overstreet wrote:
> One thing that's come out of the folios discussions with both Matthew and
> Johannes is that we seem to be thinking along similar lines regarding our end
> goals for struct page.
> 
> The fundamental reason for struct page is that we need memory to be self
> describing, without any context - we need to be able to go from a generic
> untyped struct page and figure out what it contains: handling physical memory
> failure is the most prominent example, but migration and compaction are more
> common. We need to be able to ask the thing that owns a page of memory "hey,
> stop using this and move your stuff here".
> 
> Matthew's helpfully been coming up with a list of page types:
> https://kernelnewbies.org/MemoryTypes
> 
> But struct page could be a lot smaller than it is now. I think we can get it
> down to two pointers, which means it'll take up 0.4% of system memory. Both
> Matthew and Johannes have ideas for getting it down even further - the main
> thing to note is that virt_to_page() _should_ be an uncommon operation (most of
> the places we're currently using it are completely unnecessary, look at all the
> places we're using it on the zero page). Johannes is thinking two layer radix
> tree, Matthew was thinking about using maple trees - personally, I think that
> 0.4% of system memory is plenty good enough.
> 
> 
> Ok, but what do we do with the stuff currently in struct page?
> -------------------------------------------------------------
> 
> The main thing to note is that since in normal operation most folios are going
> to be describing many pages, not just one - and we'll be using _less_ memory
> overall if we allocate them separately. That's cool.
> 
> Of course, for this to make sense, we'll have to get all the other stuff in
> struct page moved into their own types, but file & anon pages are the big one,
> and that's already being tackled.
> 
> Why two ulongs/pointers, instead of just one?
> ---------------------------------------------
> 
> Because one of the things we really want and don't have now is a clean division
> between allocator and allocatee state. Allocator meaning either the buddy
> allocator or slab, allocatee state would be the folio or the network pool state
> or whatever actually called kmalloc() or alloc_pages().
> 
> Right now slab state sits in the same place in struct page where allocatee state
> does, and the reason this is bad is that slab/slub are a hell of a lot faster
> than the buddy allocator, and Johannes wants to move the boundary between slab
> allocations and buddy allocator allocations up to like 64k. If we fix where slab
> state lives, this will become completely trivial to do.
> 
> So if we have this:
> 
> struct page {
> 	unsigned long	allocator;
> 	unsigned long	allocatee;
> };
> 
> The allocator field would be used for either a pointer to slab/slub's state, if
> it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> allocation - like compound order today, and probably whether or not the
> (compound group of) pages is free.
> 
> The allocatee field would be used for a type tagged (using the low bits of the
> pointer) to one of:
>   - struct folio
>   - struct anon_folio, if that becomes a thing
>   - struct network_pool_page
>   - struct pte_page
>   - struct zone_device_page
> 
> Then we can further refactor things until all the stuff that's currently crammed
> in struct page lives in types where each struct field means one and precisely
> one thing, and also where we can freely reshuffle and reorganize and add stuff
> to the various types where we couldn't before because it'd make struct page
> bigger.
> 
> Other notes & potential issues:
>   - page->compound_dtor needs to die
> 
>   - page->rcu_head moves into the types that actually need it, no issues there
> 
>   - page->refcount has question marks around it. I think we can also just move it
>     into the types that need it; with RCU derefing the pointer to the folio or
>     whatever and grabing a ref on folio->refcount can happen under a RCU read
>     lock - there's no real question about whether it's technically possible to
>     get it out of struct page, and I think it would be cleaner overall that way.
> 
>     However, depending on how it's used from code paths that go from generic
>     untyped pages, I could see it turning into more of a hassle than it's worth.
>     More investigation is needed.
> 
>   - page->memcg_data - I don't know whether that one more properly belongs in
>     struct page or in the page subtypes - I'd love it if Johannes could talk
>     about that one.
> 
>   - page->flags - dealing with this is going to be a huge hassle but also where
>     we'll find some of the biggest gains in overall sanity and readability of the
>     code. Right now, PG_locked is super special and ad hoc and I have run into
>     situations multiple times (and Johannes was in vehement agreement on this
>     one) where I simply could not figure the behaviour of the current code re:
>     who is responsible for locking pages without instrumenting the code with
>     assertions.
> 
>     Meaning anything we do to create and enforce module boundaries between
>     different chunks of code is going to suck, but the end result should be
>     really worthwhile.
> 
> Matthew Wilcox and David Howells have been having conversations on IRC about
> what to do about other page bits. It appears we should be able to kill a lot of
> filesystem usage of both PG_private and PG_private_2 - filesystems in general
> hang state off of page->private, soon to be folio->private, and PG_private in
> current use just indicates whether page->private is nonzero - meaning it's
> completely redundant.
> 

Don't get me wrong, but before there are answers to some of the very 
basic questions raised above (especially everything that lives in 
page->flags, which are not only page flags, refcount, ...) this isn't 
very tempting to spend more time on, from a reviewer perspective.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Mapcount of subpages
  2021-09-23  5:15   ` Kent Overstreet
@ 2021-09-23 11:40     ` Matthew Wilcox
  2021-09-23 12:45       ` Kirill A. Shutemov
  2021-09-23 18:56       ` Mike Kravetz
  0 siblings, 2 replies; 43+ messages in thread
From: Matthew Wilcox @ 2021-09-23 11:40 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-fsdevel, linux-kernel, linux-mm, Johannes Weiner,
	Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Kirill A. Shutemov,
	Mike Kravetz

On Thu, Sep 23, 2021 at 01:15:16AM -0400, Kent Overstreet wrote:
> On Thu, Sep 23, 2021 at 04:23:12AM +0100, Matthew Wilcox wrote:
> > (compiling that list reminds me that we'll need to sort out mapcount
> > on subpages when it comes time to do this.  ask me if you don't know
> > what i'm talking about here.)
> 
> I am curious why we would ever need a mapcount for just part of a page, tell me
> more.

I would say Kirill is the expert here.  My understanding:

We have three different approaches to allocating 2MB pages today;
anon THP, shmem THP and hugetlbfs.  Hugetlbfs can only be mapped on a
2MB boundary, so it has no special handling of mapcount [1].  Anon THP
always starts out as being mapped exclusively on a 2MB boundary, but
then it can be split by, eg, munmap().  If it is, then the mapcount in
the head page is distributed to the subpages.

Shmem THP is the tricky one.  You might have a 2MB page in the page cache,
but then have processes which only ever map part of it.  Or you might
have some processes mapping it with a 2MB entry and others mapping part
or all of it with 4kB entries.  And then someone truncates the file to
midway through this page; we split it, and now we need to figure out what
the mapcount should be on each of the subpages.  We handle this by using
->mapcount on each subpage to record how many non-2MB mappings there are
of that specific page and using ->compound_mapcount to record how many 2MB
mappings there are of the entire 2MB page.  Then, when we split, we just
need to distribute the compound_mapcount to each page to make it correct.
We also have the PageDoubleMap flag to tell us whether anybody has this
2MB page mapped with 4kB entries, so we can skip all the summing of 4kB
mapcounts if nobody has done that.

[1] Mike is looking to change this, but I'm not sure where he is with it.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
  2021-09-23 11:40     ` Mapcount of subpages Matthew Wilcox
@ 2021-09-23 12:45       ` Kirill A. Shutemov
  2021-09-23 21:10           ` Hugh Dickins
  2021-09-23 18:56       ` Mike Kravetz
  1 sibling, 1 reply; 43+ messages in thread
From: Kirill A. Shutemov @ 2021-09-23 12:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, linux-fsdevel, linux-kernel, linux-mm,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

On Thu, Sep 23, 2021 at 12:40:14PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 23, 2021 at 01:15:16AM -0400, Kent Overstreet wrote:
> > On Thu, Sep 23, 2021 at 04:23:12AM +0100, Matthew Wilcox wrote:
> > > (compiling that list reminds me that we'll need to sort out mapcount
> > > on subpages when it comes time to do this.  ask me if you don't know
> > > what i'm talking about here.)
> > 
> > I am curious why we would ever need a mapcount for just part of a page, tell me
> > more.
> 
> I would say Kirill is the expert here.  My understanding:
> 
> We have three different approaches to allocating 2MB pages today;
> anon THP, shmem THP and hugetlbfs.  Hugetlbfs can only be mapped on a
> 2MB boundary, so it has no special handling of mapcount [1].  Anon THP
> always starts out as being mapped exclusively on a 2MB boundary, but
> then it can be split by, eg, munmap().  If it is, then the mapcount in
> the head page is distributed to the subpages.

One more complication for anon THP is that it can be shared across fork()
and one process may split it while other have it mapped with PMD.

> Shmem THP is the tricky one.  You might have a 2MB page in the page cache,
> but then have processes which only ever map part of it.  Or you might
> have some processes mapping it with a 2MB entry and others mapping part
> or all of it with 4kB entries.  And then someone truncates the file to
> midway through this page; we split it, and now we need to figure out what
> the mapcount should be on each of the subpages.  We handle this by using
> ->mapcount on each subpage to record how many non-2MB mappings there are
> of that specific page and using ->compound_mapcount to record how many 2MB
> mappings there are of the entire 2MB page.  Then, when we split, we just
> need to distribute the compound_mapcount to each page to make it correct.
> We also have the PageDoubleMap flag to tell us whether anybody has this
> 2MB page mapped with 4kB entries, so we can skip all the summing of 4kB
> mapcounts if nobody has done that.

Possible future complication comes from 1G THP effort. With 1G THP we
would have whole hierarchy of mapcounts: 1 PUD mapcount, 512 PMD
mapcounts and 262144 PTE mapcounts. (That's one of the reasons I don't
think 1G THP is viable.)

Note that there are places where exact mapcount accounting is critical:
try_to_unmap() may finish prematurely if we underestimate mapcount and
overestimating mapcount may lead to superfluous CoW that breaks GUP.

> 
> [1] Mike is looking to change this, but I'm not sure where he is with it.
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-23  9:03 ` Struct page proposal David Hildenbrand
@ 2021-09-23 15:22   ` Kent Overstreet
  2021-09-23 15:34     ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2021-09-23 15:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-fsdevel, linux-kernel, linux-mm, Johannes Weiner,
	Matthew Wilcox, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On Thu, Sep 23, 2021 at 11:03:44AM +0200, David Hildenbrand wrote:
> Don't get me wrong, but before there are answers to some of the very basic
> questions raised above (especially everything that lives in page->flags,
> which are not only page flags, refcount, ...) this isn't very tempting to
> spend more time on, from a reviewer perspective.

Did you miss the part of the folios discussion where we were talking about how
acrimonious it had gotten and why, and talking about (Chris Mason in particular)
writing design docs up front and how they'd been pretty successful in other
places?

We're trying something new here, and trying to give people an opportunity to
discussion what we're trying to do _before_ dumping thousands and thousands of
lines of refactoring patches on the list.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-23 15:22   ` Kent Overstreet
@ 2021-09-23 15:34     ` David Hildenbrand
  0 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2021-09-23 15:34 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-fsdevel, linux-kernel, linux-mm, Johannes Weiner,
	Matthew Wilcox, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On 23.09.21 17:22, Kent Overstreet wrote:
> On Thu, Sep 23, 2021 at 11:03:44AM +0200, David Hildenbrand wrote:
>> Don't get me wrong, but before there are answers to some of the
>> very basic questions raised above (especially everything that lives
>> in page->flags, which are not only page flags, refcount, ...) this
>> isn't very tempting to spend more time on, from a reviewer
>> perspective.
> 
> Did you miss the part of the folios discussion where we were talking
> about how acrimonious it had gotten and why, and talking about (Chris
> Mason in particular) writing design docs up front and how they'd been
> pretty successful in other places?
> 
> We're trying something new here, and trying to give people an
> opportunity to discussion what we're trying to do _before_ dumping
> thousands and thousands of lines of refactoring patches on the list.
> 

This here is different: the very basic questions haven't been solved.
Folios compiled. Folios worked. I stopped following the discussion at 
one point, though.

Again, don't get me wrong, but what I read in this mail was "I don't
know how to solve most of this but this is what we could do.".

Would we want to reduce the struct page size? Sure! Do we have a
concrete plan on how all the corner cases would work? No.

IIRC Windows uses exactly one pointer (8 bytes) to track the state of a 
physical page by linking it into the right list. So what would you say 
if I proposed that without tackling the hard cases?

Corner cases is what make it hard. Memory holes. Memory hot(un)plug. 
Page isolation. Memory poisoning. Various memory allocators. Lock-free 
physical memory walkers. And that's all outside the scope of filesystems.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
  2021-09-23 11:40     ` Mapcount of subpages Matthew Wilcox
  2021-09-23 12:45       ` Kirill A. Shutemov
@ 2021-09-23 18:56       ` Mike Kravetz
  1 sibling, 0 replies; 43+ messages in thread
From: Mike Kravetz @ 2021-09-23 18:56 UTC (permalink / raw)
  To: Matthew Wilcox, Kent Overstreet
  Cc: linux-fsdevel, linux-kernel, linux-mm, Johannes Weiner,
	Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Kirill A. Shutemov

On 9/23/21 4:40 AM, Matthew Wilcox wrote:
> On Thu, Sep 23, 2021 at 01:15:16AM -0400, Kent Overstreet wrote:
>> On Thu, Sep 23, 2021 at 04:23:12AM +0100, Matthew Wilcox wrote:
>>> (compiling that list reminds me that we'll need to sort out mapcount
>>> on subpages when it comes time to do this.  ask me if you don't know
>>> what i'm talking about here.)
>>
>> I am curious why we would ever need a mapcount for just part of a page, tell me
>> more.
> 
> I would say Kirill is the expert here.  My understanding:
> 
> We have three different approaches to allocating 2MB pages today;
> anon THP, shmem THP and hugetlbfs.  Hugetlbfs can only be mapped on a
> 2MB boundary, so it has no special handling of mapcount [1].  Anon THP
> always starts out as being mapped exclusively on a 2MB boundary, but
> then it can be split by, eg, munmap().  If it is, then the mapcount in
> the head page is distributed to the subpages.
> 
> Shmem THP is the tricky one.  You might have a 2MB page in the page cache,
> but then have processes which only ever map part of it.  Or you might
> have some processes mapping it with a 2MB entry and others mapping part
> or all of it with 4kB entries.  And then someone truncates the file to
> midway through this page; we split it, and now we need to figure out what
> the mapcount should be on each of the subpages.  We handle this by using
> ->mapcount on each subpage to record how many non-2MB mappings there are
> of that specific page and using ->compound_mapcount to record how many 2MB
> mappings there are of the entire 2MB page.  Then, when we split, we just
> need to distribute the compound_mapcount to each page to make it correct.
> We also have the PageDoubleMap flag to tell us whether anybody has this
> 2MB page mapped with 4kB entries, so we can skip all the summing of 4kB
> mapcounts if nobody has done that.
> 
> [1] Mike is looking to change this, but I'm not sure where he is with it.

Not just me, but several others also interested in 'double mapping'
hugetlb pages.  We are very early in the process and have not yet got
to this level of detail.  My initial thought is that this may not be an
issue.  Although we will allow different size mappings, we will never
split the underlying huge page.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
  2021-09-23 12:45       ` Kirill A. Shutemov
@ 2021-09-23 21:10           ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2021-09-23 21:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, Kent Overstreet, linux-fsdevel, linux-kernel,
	linux-mm, Johannes Weiner, Linus Torvalds, Andrew Morton,
	Darrick J. Wong, Christoph Hellwig, David Howells, Mike Kravetz

On Thu, 23 Sep 2021, Kirill A. Shutemov wrote:
> On Thu, Sep 23, 2021 at 12:40:14PM +0100, Matthew Wilcox wrote:
> > On Thu, Sep 23, 2021 at 01:15:16AM -0400, Kent Overstreet wrote:
> > > On Thu, Sep 23, 2021 at 04:23:12AM +0100, Matthew Wilcox wrote:
> > > > (compiling that list reminds me that we'll need to sort out mapcount
> > > > on subpages when it comes time to do this.  ask me if you don't know
> > > > what i'm talking about here.)
> > > 
> > > I am curious why we would ever need a mapcount for just part of a page, tell me
> > > more.
> > 
> > I would say Kirill is the expert here.  My understanding:
> > 
> > We have three different approaches to allocating 2MB pages today;
> > anon THP, shmem THP and hugetlbfs.  Hugetlbfs can only be mapped on a
> > 2MB boundary, so it has no special handling of mapcount [1].  Anon THP
> > always starts out as being mapped exclusively on a 2MB boundary, but
> > then it can be split by, eg, munmap().  If it is, then the mapcount in
> > the head page is distributed to the subpages.
> 
> One more complication for anon THP is that it can be shared across fork()
> and one process may split it while other have it mapped with PMD.
> 
> > Shmem THP is the tricky one.  You might have a 2MB page in the page cache,
> > but then have processes which only ever map part of it.  Or you might
> > have some processes mapping it with a 2MB entry and others mapping part
> > or all of it with 4kB entries.  And then someone truncates the file to
> > midway through this page; we split it, and now we need to figure out what
> > the mapcount should be on each of the subpages.  We handle this by using
> > ->mapcount on each subpage to record how many non-2MB mappings there are
> > of that specific page and using ->compound_mapcount to record how many 2MB
> > mappings there are of the entire 2MB page.  Then, when we split, we just
> > need to distribute the compound_mapcount to each page to make it correct.
> > We also have the PageDoubleMap flag to tell us whether anybody has this
> > 2MB page mapped with 4kB entries, so we can skip all the summing of 4kB
> > mapcounts if nobody has done that.
> 
> Possible future complication comes from 1G THP effort. With 1G THP we
> would have whole hierarchy of mapcounts: 1 PUD mapcount, 512 PMD
> mapcounts and 262144 PTE mapcounts. (That's one of the reasons I don't
> think 1G THP is viable.)
> 
> Note that there are places where exact mapcount accounting is critical:
> try_to_unmap() may finish prematurely if we underestimate mapcount and
> overestimating mapcount may lead to superfluous CoW that breaks GUP.

It is critical to know for sure when a page has been completely unmapped:
but that does not need ptes of subpages to be accounted in the _mapcount
field of subpages - they just need to be counted in the compound page's
total_mapcount.

I may be wrong, I never had time to prove it one way or the other: but
I have a growing suspicion that the *only* reason for maintaining tail
_mapcounts separately, is to maintain the NR_FILE_MAPPED count exactly
(in the face of pmd mappings overlapping pte mappings).

NR_FILE_MAPPED being used for /proc/meminfo's "Mapped:" and a couple
of other such stats files, and for a reclaim heuristic in mm/vmscan.c.

Allow ourselves more slack in NR_FILE_MAPPED accounting (either count
each pte as if it mapped the whole THP, or don't count a THP's ptes
at all - you opted for the latter in the "Mlocked:" accounting),
and I suspect subpage _mapcount could be abandoned.

But you have a different point in mind when you refer to superfluous
CoW and GUP: I don't know the score there (and I think we are still in
that halfway zone, since pte CoW was changed to depend on page_count,
but THP CoW still depending on mapcount).

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
@ 2021-09-23 21:10           ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2021-09-23 21:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, Kent Overstreet, linux-fsdevel, linux-kernel,
	linux-mm, Johannes Weiner, Linus Torvalds, Andrew Morton,
	Darrick J. Wong, Christoph Hellwig, David Howells, Mike Kravetz

On Thu, 23 Sep 2021, Kirill A. Shutemov wrote:
> On Thu, Sep 23, 2021 at 12:40:14PM +0100, Matthew Wilcox wrote:
> > On Thu, Sep 23, 2021 at 01:15:16AM -0400, Kent Overstreet wrote:
> > > On Thu, Sep 23, 2021 at 04:23:12AM +0100, Matthew Wilcox wrote:
> > > > (compiling that list reminds me that we'll need to sort out mapcount
> > > > on subpages when it comes time to do this.  ask me if you don't know
> > > > what i'm talking about here.)
> > > 
> > > I am curious why we would ever need a mapcount for just part of a page, tell me
> > > more.
> > 
> > I would say Kirill is the expert here.  My understanding:
> > 
> > We have three different approaches to allocating 2MB pages today;
> > anon THP, shmem THP and hugetlbfs.  Hugetlbfs can only be mapped on a
> > 2MB boundary, so it has no special handling of mapcount [1].  Anon THP
> > always starts out as being mapped exclusively on a 2MB boundary, but
> > then it can be split by, eg, munmap().  If it is, then the mapcount in
> > the head page is distributed to the subpages.
> 
> One more complication for anon THP is that it can be shared across fork()
> and one process may split it while other have it mapped with PMD.
> 
> > Shmem THP is the tricky one.  You might have a 2MB page in the page cache,
> > but then have processes which only ever map part of it.  Or you might
> > have some processes mapping it with a 2MB entry and others mapping part
> > or all of it with 4kB entries.  And then someone truncates the file to
> > midway through this page; we split it, and now we need to figure out what
> > the mapcount should be on each of the subpages.  We handle this by using
> > ->mapcount on each subpage to record how many non-2MB mappings there are
> > of that specific page and using ->compound_mapcount to record how many 2MB
> > mappings there are of the entire 2MB page.  Then, when we split, we just
> > need to distribute the compound_mapcount to each page to make it correct.
> > We also have the PageDoubleMap flag to tell us whether anybody has this
> > 2MB page mapped with 4kB entries, so we can skip all the summing of 4kB
> > mapcounts if nobody has done that.
> 
> Possible future complication comes from 1G THP effort. With 1G THP we
> would have whole hierarchy of mapcounts: 1 PUD mapcount, 512 PMD
> mapcounts and 262144 PTE mapcounts. (That's one of the reasons I don't
> think 1G THP is viable.)
> 
> Note that there are places where exact mapcount accounting is critical:
> try_to_unmap() may finish prematurely if we underestimate mapcount and
> overestimating mapcount may lead to superfluous CoW that breaks GUP.

It is critical to know for sure when a page has been completely unmapped:
but that does not need ptes of subpages to be accounted in the _mapcount
field of subpages - they just need to be counted in the compound page's
total_mapcount.

I may be wrong, I never had time to prove it one way or the other: but
I have a growing suspicion that the *only* reason for maintaining tail
_mapcounts separately, is to maintain the NR_FILE_MAPPED count exactly
(in the face of pmd mappings overlapping pte mappings).

NR_FILE_MAPPED being used for /proc/meminfo's "Mapped:" and a couple
of other such stats files, and for a reclaim heuristic in mm/vmscan.c.

Allow ourselves more slack in NR_FILE_MAPPED accounting (either count
each pte as if it mapped the whole THP, or don't count a THP's ptes
at all - you opted for the latter in the "Mlocked:" accounting),
and I suspect subpage _mapcount could be abandoned.

But you have a different point in mind when you refer to superfluous
CoW and GUP: I don't know the score there (and I think we are still in
that halfway zone, since pte CoW was changed to depend on page_count,
but THP CoW still depending on mapcount).

Hugh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
  2021-09-23 21:10           ` Hugh Dickins
@ 2021-09-23 21:54             ` Yang Shi
  -1 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-09-23 21:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Matthew Wilcox, Kent Overstreet,
	Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux MM,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

On Thu, Sep 23, 2021 at 2:10 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 23 Sep 2021, Kirill A. Shutemov wrote:
> > On Thu, Sep 23, 2021 at 12:40:14PM +0100, Matthew Wilcox wrote:
> > > On Thu, Sep 23, 2021 at 01:15:16AM -0400, Kent Overstreet wrote:
> > > > On Thu, Sep 23, 2021 at 04:23:12AM +0100, Matthew Wilcox wrote:
> > > > > (compiling that list reminds me that we'll need to sort out mapcount
> > > > > on subpages when it comes time to do this.  ask me if you don't know
> > > > > what i'm talking about here.)
> > > >
> > > > I am curious why we would ever need a mapcount for just part of a page, tell me
> > > > more.
> > >
> > > I would say Kirill is the expert here.  My understanding:
> > >
> > > We have three different approaches to allocating 2MB pages today;
> > > anon THP, shmem THP and hugetlbfs.  Hugetlbfs can only be mapped on a
> > > 2MB boundary, so it has no special handling of mapcount [1].  Anon THP
> > > always starts out as being mapped exclusively on a 2MB boundary, but
> > > then it can be split by, eg, munmap().  If it is, then the mapcount in
> > > the head page is distributed to the subpages.
> >
> > One more complication for anon THP is that it can be shared across fork()
> > and one process may split it while other have it mapped with PMD.
> >
> > > Shmem THP is the tricky one.  You might have a 2MB page in the page cache,
> > > but then have processes which only ever map part of it.  Or you might
> > > have some processes mapping it with a 2MB entry and others mapping part
> > > or all of it with 4kB entries.  And then someone truncates the file to
> > > midway through this page; we split it, and now we need to figure out what
> > > the mapcount should be on each of the subpages.  We handle this by using
> > > ->mapcount on each subpage to record how many non-2MB mappings there are
> > > of that specific page and using ->compound_mapcount to record how many 2MB
> > > mappings there are of the entire 2MB page.  Then, when we split, we just
> > > need to distribute the compound_mapcount to each page to make it correct.
> > > We also have the PageDoubleMap flag to tell us whether anybody has this
> > > 2MB page mapped with 4kB entries, so we can skip all the summing of 4kB
> > > mapcounts if nobody has done that.
> >
> > Possible future complication comes from 1G THP effort. With 1G THP we
> > would have whole hierarchy of mapcounts: 1 PUD mapcount, 512 PMD
> > mapcounts and 262144 PTE mapcounts. (That's one of the reasons I don't
> > think 1G THP is viable.)
> >
> > Note that there are places where exact mapcount accounting is critical:
> > try_to_unmap() may finish prematurely if we underestimate mapcount and
> > overestimating mapcount may lead to superfluous CoW that breaks GUP.
>
> It is critical to know for sure when a page has been completely unmapped:
> but that does not need ptes of subpages to be accounted in the _mapcount
> field of subpages - they just need to be counted in the compound page's
> total_mapcount.
>
> I may be wrong, I never had time to prove it one way or the other: but
> I have a growing suspicion that the *only* reason for maintaining tail
> _mapcounts separately, is to maintain the NR_FILE_MAPPED count exactly
> (in the face of pmd mappings overlapping pte mappings).
>
> NR_FILE_MAPPED being used for /proc/meminfo's "Mapped:" and a couple
> of other such stats files, and for a reclaim heuristic in mm/vmscan.c.
>
> Allow ourselves more slack in NR_FILE_MAPPED accounting (either count
> each pte as if it mapped the whole THP, or don't count a THP's ptes
> at all - you opted for the latter in the "Mlocked:" accounting),
> and I suspect subpage _mapcount could be abandoned.

AFAIK, partial THP unmap may need the _mapcount information of every
subpage otherwise the deferred split can't know what subpages could be
freed.

>
> But you have a different point in mind when you refer to superfluous
> CoW and GUP: I don't know the score there (and I think we are still in
> that halfway zone, since pte CoW was changed to depend on page_count,
> but THP CoW still depending on mapcount).
>
> Hugh
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
@ 2021-09-23 21:54             ` Yang Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-09-23 21:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Matthew Wilcox, Kent Overstreet,
	Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux MM,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

On Thu, Sep 23, 2021 at 2:10 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 23 Sep 2021, Kirill A. Shutemov wrote:
> > On Thu, Sep 23, 2021 at 12:40:14PM +0100, Matthew Wilcox wrote:
> > > On Thu, Sep 23, 2021 at 01:15:16AM -0400, Kent Overstreet wrote:
> > > > On Thu, Sep 23, 2021 at 04:23:12AM +0100, Matthew Wilcox wrote:
> > > > > (compiling that list reminds me that we'll need to sort out mapcount
> > > > > on subpages when it comes time to do this.  ask me if you don't know
> > > > > what i'm talking about here.)
> > > >
> > > > I am curious why we would ever need a mapcount for just part of a page, tell me
> > > > more.
> > >
> > > I would say Kirill is the expert here.  My understanding:
> > >
> > > We have three different approaches to allocating 2MB pages today;
> > > anon THP, shmem THP and hugetlbfs.  Hugetlbfs can only be mapped on a
> > > 2MB boundary, so it has no special handling of mapcount [1].  Anon THP
> > > always starts out as being mapped exclusively on a 2MB boundary, but
> > > then it can be split by, eg, munmap().  If it is, then the mapcount in
> > > the head page is distributed to the subpages.
> >
> > One more complication for anon THP is that it can be shared across fork()
> > and one process may split it while other have it mapped with PMD.
> >
> > > Shmem THP is the tricky one.  You might have a 2MB page in the page cache,
> > > but then have processes which only ever map part of it.  Or you might
> > > have some processes mapping it with a 2MB entry and others mapping part
> > > or all of it with 4kB entries.  And then someone truncates the file to
> > > midway through this page; we split it, and now we need to figure out what
> > > the mapcount should be on each of the subpages.  We handle this by using
> > > ->mapcount on each subpage to record how many non-2MB mappings there are
> > > of that specific page and using ->compound_mapcount to record how many 2MB
> > > mappings there are of the entire 2MB page.  Then, when we split, we just
> > > need to distribute the compound_mapcount to each page to make it correct.
> > > We also have the PageDoubleMap flag to tell us whether anybody has this
> > > 2MB page mapped with 4kB entries, so we can skip all the summing of 4kB
> > > mapcounts if nobody has done that.
> >
> > Possible future complication comes from 1G THP effort. With 1G THP we
> > would have whole hierarchy of mapcounts: 1 PUD mapcount, 512 PMD
> > mapcounts and 262144 PTE mapcounts. (That's one of the reasons I don't
> > think 1G THP is viable.)
> >
> > Note that there are places where exact mapcount accounting is critical:
> > try_to_unmap() may finish prematurely if we underestimate mapcount and
> > overestimating mapcount may lead to superfluous CoW that breaks GUP.
>
> It is critical to know for sure when a page has been completely unmapped:
> but that does not need ptes of subpages to be accounted in the _mapcount
> field of subpages - they just need to be counted in the compound page's
> total_mapcount.
>
> I may be wrong, I never had time to prove it one way or the other: but
> I have a growing suspicion that the *only* reason for maintaining tail
> _mapcounts separately, is to maintain the NR_FILE_MAPPED count exactly
> (in the face of pmd mappings overlapping pte mappings).
>
> NR_FILE_MAPPED being used for /proc/meminfo's "Mapped:" and a couple
> of other such stats files, and for a reclaim heuristic in mm/vmscan.c.
>
> Allow ourselves more slack in NR_FILE_MAPPED accounting (either count
> each pte as if it mapped the whole THP, or don't count a THP's ptes
> at all - you opted for the latter in the "Mlocked:" accounting),
> and I suspect subpage _mapcount could be abandoned.

AFAIK, partial THP unmap may need the _mapcount information of every
subpage otherwise the deferred split can't know what subpages could be
freed.

>
> But you have a different point in mind when you refer to superfluous
> CoW and GUP: I don't know the score there (and I think we are still in
> that halfway zone, since pte CoW was changed to depend on page_count,
> but THP CoW still depending on mapcount).
>
> Hugh
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
  2021-09-23 21:54             ` Yang Shi
  (?)
@ 2021-09-23 22:23             ` Zi Yan
  2021-09-23 23:48                 ` Hugh Dickins
  -1 siblings, 1 reply; 43+ messages in thread
From: Zi Yan @ 2021-09-23 22:23 UTC (permalink / raw)
  To: Yang Shi, Kirill A. Shutemov
  Cc: Hugh Dickins, Matthew Wilcox, Kent Overstreet,
	Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux MM,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 4604 bytes --]

On 23 Sep 2021, at 17:54, Yang Shi wrote:

> On Thu, Sep 23, 2021 at 2:10 PM Hugh Dickins <hughd@google.com> wrote:
>>
>> On Thu, 23 Sep 2021, Kirill A. Shutemov wrote:
>>> On Thu, Sep 23, 2021 at 12:40:14PM +0100, Matthew Wilcox wrote:
>>>> On Thu, Sep 23, 2021 at 01:15:16AM -0400, Kent Overstreet wrote:
>>>>> On Thu, Sep 23, 2021 at 04:23:12AM +0100, Matthew Wilcox wrote:
>>>>>> (compiling that list reminds me that we'll need to sort out mapcount
>>>>>> on subpages when it comes time to do this.  ask me if you don't know
>>>>>> what i'm talking about here.)
>>>>>
>>>>> I am curious why we would ever need a mapcount for just part of a page, tell me
>>>>> more.
>>>>
>>>> I would say Kirill is the expert here.  My understanding:
>>>>
>>>> We have three different approaches to allocating 2MB pages today;
>>>> anon THP, shmem THP and hugetlbfs.  Hugetlbfs can only be mapped on a
>>>> 2MB boundary, so it has no special handling of mapcount [1].  Anon THP
>>>> always starts out as being mapped exclusively on a 2MB boundary, but
>>>> then it can be split by, eg, munmap().  If it is, then the mapcount in
>>>> the head page is distributed to the subpages.
>>>
>>> One more complication for anon THP is that it can be shared across fork()
>>> and one process may split it while other have it mapped with PMD.
>>>
>>>> Shmem THP is the tricky one.  You might have a 2MB page in the page cache,
>>>> but then have processes which only ever map part of it.  Or you might
>>>> have some processes mapping it with a 2MB entry and others mapping part
>>>> or all of it with 4kB entries.  And then someone truncates the file to
>>>> midway through this page; we split it, and now we need to figure out what
>>>> the mapcount should be on each of the subpages.  We handle this by using
>>>> ->mapcount on each subpage to record how many non-2MB mappings there are
>>>> of that specific page and using ->compound_mapcount to record how many 2MB
>>>> mappings there are of the entire 2MB page.  Then, when we split, we just
>>>> need to distribute the compound_mapcount to each page to make it correct.
>>>> We also have the PageDoubleMap flag to tell us whether anybody has this
>>>> 2MB page mapped with 4kB entries, so we can skip all the summing of 4kB
>>>> mapcounts if nobody has done that.
>>>
>>> Possible future complication comes from 1G THP effort. With 1G THP we
>>> would have whole hierarchy of mapcounts: 1 PUD mapcount, 512 PMD
>>> mapcounts and 262144 PTE mapcounts. (That's one of the reasons I don't
>>> think 1G THP is viable.)

Maybe we do not need to support triple map. Instead, only allow PUD and PMD
mappings and split 1GB THP to 2MB THPs before a PTE mapping is established.
How likely is a 1GB THP going to be mapped by PUD and PTE entries? I guess
it might be very rare.

>>>
>>> Note that there are places where exact mapcount accounting is critical:
>>> try_to_unmap() may finish prematurely if we underestimate mapcount and
>>> overestimating mapcount may lead to superfluous CoW that breaks GUP.
>>
>> It is critical to know for sure when a page has been completely unmapped:
>> but that does not need ptes of subpages to be accounted in the _mapcount
>> field of subpages - they just need to be counted in the compound page's
>> total_mapcount.
>>
>> I may be wrong, I never had time to prove it one way or the other: but
>> I have a growing suspicion that the *only* reason for maintaining tail
>> _mapcounts separately, is to maintain the NR_FILE_MAPPED count exactly
>> (in the face of pmd mappings overlapping pte mappings).
>>
>> NR_FILE_MAPPED being used for /proc/meminfo's "Mapped:" and a couple
>> of other such stats files, and for a reclaim heuristic in mm/vmscan.c.
>>
>> Allow ourselves more slack in NR_FILE_MAPPED accounting (either count
>> each pte as if it mapped the whole THP, or don't count a THP's ptes
>> at all - you opted for the latter in the "Mlocked:" accounting),
>> and I suspect subpage _mapcount could be abandoned.
>
> AFAIK, partial THP unmap may need the _mapcount information of every
> subpage otherwise the deferred split can't know what subpages could be
> freed.

Could we just scan page tables of a THP during deferred split process
instead? Deferred split is a slow path already, so maybe it can afford
the extra work.

>
>>
>> But you have a different point in mind when you refer to superfluous
>> CoW and GUP: I don't know the score there (and I think we are still in
>> that halfway zone, since pte CoW was changed to depend on page_count,
>> but THP CoW still depending on mapcount).
>>
>> Hugh
>>


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
  2021-09-23 22:23             ` Zi Yan
@ 2021-09-23 23:48                 ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2021-09-23 23:48 UTC (permalink / raw)
  To: Zi Yan
  Cc: Yang Shi, Kirill A. Shutemov, Hugh Dickins, Matthew Wilcox,
	Kent Overstreet, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, Linux MM, Johannes Weiner,
	Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

On Thu, 23 Sep 2021, Zi Yan wrote:
> On 23 Sep 2021, at 17:54, Yang Shi wrote:
> > On Thu, Sep 23, 2021 at 2:10 PM Hugh Dickins <hughd@google.com> wrote:
> >>
> >> NR_FILE_MAPPED being used for /proc/meminfo's "Mapped:" and a couple
> >> of other such stats files, and for a reclaim heuristic in mm/vmscan.c.
> >>
> >> Allow ourselves more slack in NR_FILE_MAPPED accounting (either count
> >> each pte as if it mapped the whole THP, or don't count a THP's ptes
> >> at all - you opted for the latter in the "Mlocked:" accounting),
> >> and I suspect subpage _mapcount could be abandoned.
> >
> > AFAIK, partial THP unmap may need the _mapcount information of every
> > subpage otherwise the deferred split can't know what subpages could be
> > freed.

I believe Yang Shi is right insofar as the decision on whether it's worth
queuing for deferred split is being done based on those subpage _mapcounts.
That is a use I had not considered, and I've given no thought to how
important or not it is.

> 
> Could we just scan page tables of a THP during deferred split process
> instead? Deferred split is a slow path already, so maybe it can afford
> the extra work.

But unless I misunderstand, actually carrying out the deferred split
already unmaps, uses migration entries, and remaps the remaining ptes:
needing no help from subpage _mapcounts to do those, and free the rest.

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
@ 2021-09-23 23:48                 ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2021-09-23 23:48 UTC (permalink / raw)
  To: Zi Yan
  Cc: Yang Shi, Kirill A. Shutemov, Hugh Dickins, Matthew Wilcox,
	Kent Overstreet, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, Linux MM, Johannes Weiner,
	Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

On Thu, 23 Sep 2021, Zi Yan wrote:
> On 23 Sep 2021, at 17:54, Yang Shi wrote:
> > On Thu, Sep 23, 2021 at 2:10 PM Hugh Dickins <hughd@google.com> wrote:
> >>
> >> NR_FILE_MAPPED being used for /proc/meminfo's "Mapped:" and a couple
> >> of other such stats files, and for a reclaim heuristic in mm/vmscan.c.
> >>
> >> Allow ourselves more slack in NR_FILE_MAPPED accounting (either count
> >> each pte as if it mapped the whole THP, or don't count a THP's ptes
> >> at all - you opted for the latter in the "Mlocked:" accounting),
> >> and I suspect subpage _mapcount could be abandoned.
> >
> > AFAIK, partial THP unmap may need the _mapcount information of every
> > subpage otherwise the deferred split can't know what subpages could be
> > freed.

I believe Yang Shi is right insofar as the decision on whether it's worth
queuing for deferred split is being done based on those subpage _mapcounts.
That is a use I had not considered, and I've given no thought to how
important or not it is.

> 
> Could we just scan page tables of a THP during deferred split process
> instead? Deferred split is a slow path already, so maybe it can afford
> the extra work.

But unless I misunderstand, actually carrying out the deferred split
already unmaps, uses migration entries, and remaps the remaining ptes:
needing no help from subpage _mapcounts to do those, and free the rest.

Hugh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
  2021-09-23 23:48                 ` Hugh Dickins
  (?)
@ 2021-09-24  0:25                 ` Zi Yan
  2021-09-24  0:57                     ` Hugh Dickins
  -1 siblings, 1 reply; 43+ messages in thread
From: Zi Yan @ 2021-09-24  0:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Yang Shi, Kirill A. Shutemov, Matthew Wilcox, Kent Overstreet,
	Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux MM,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

[-- Attachment #1: Type: text/plain, Size: 2032 bytes --]

On 23 Sep 2021, at 19:48, Hugh Dickins wrote:

> On Thu, 23 Sep 2021, Zi Yan wrote:
>> On 23 Sep 2021, at 17:54, Yang Shi wrote:
>>> On Thu, Sep 23, 2021 at 2:10 PM Hugh Dickins <hughd@google.com> wrote:
>>>>
>>>> NR_FILE_MAPPED being used for /proc/meminfo's "Mapped:" and a couple
>>>> of other such stats files, and for a reclaim heuristic in mm/vmscan.c.
>>>>
>>>> Allow ourselves more slack in NR_FILE_MAPPED accounting (either count
>>>> each pte as if it mapped the whole THP, or don't count a THP's ptes
>>>> at all - you opted for the latter in the "Mlocked:" accounting),
>>>> and I suspect subpage _mapcount could be abandoned.
>>>
>>> AFAIK, partial THP unmap may need the _mapcount information of every
>>> subpage otherwise the deferred split can't know what subpages could be
>>> freed.
>
> I believe Yang Shi is right insofar as the decision on whether it's worth
> queuing for deferred split is being done based on those subpage _mapcounts.
> That is a use I had not considered, and I've given no thought to how
> important or not it is.
>
>>
>> Could we just scan page tables of a THP during deferred split process
>> instead? Deferred split is a slow path already, so maybe it can afford
>> the extra work.
>
> But unless I misunderstand, actually carrying out the deferred split
> already unmaps, uses migration entries, and remaps the remaining ptes:
> needing no help from subpage _mapcounts to do those, and free the rest.

You are right. unmap_page() during THP split is scanning the page tables
already.

For deciding whether to queue a THP for deferred split, we probably can
keep PageDoubleMap bit to indicate if any subpage is PTE mapped.

But without subpage _mapcount, detecting extra pins to a THP before split
might be not as easy as with it. This means every THP split will need to
perform unmap_page(), then check the remaining page_count to see if
THP split is possible. That would also introduce extra system-wide overheads
from unmapping pages. Am I missing anything?


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
  2021-09-24  0:25                 ` Zi Yan
@ 2021-09-24  0:57                     ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2021-09-24  0:57 UTC (permalink / raw)
  To: Zi Yan
  Cc: Hugh Dickins, Yang Shi, Kirill A. Shutemov, Matthew Wilcox,
	Kent Overstreet, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, Linux MM, Johannes Weiner,
	Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

On Thu, 23 Sep 2021, Zi Yan wrote:
> On 23 Sep 2021, at 19:48, Hugh Dickins wrote:
> > On Thu, 23 Sep 2021, Zi Yan wrote:
> >> On 23 Sep 2021, at 17:54, Yang Shi wrote:
> >>> On Thu, Sep 23, 2021 at 2:10 PM Hugh Dickins <hughd@google.com> wrote:
> >>>>
> >>>> NR_FILE_MAPPED being used for /proc/meminfo's "Mapped:" and a couple
> >>>> of other such stats files, and for a reclaim heuristic in mm/vmscan.c.
> >>>>
> >>>> Allow ourselves more slack in NR_FILE_MAPPED accounting (either count
> >>>> each pte as if it mapped the whole THP, or don't count a THP's ptes
> >>>> at all - you opted for the latter in the "Mlocked:" accounting),
> >>>> and I suspect subpage _mapcount could be abandoned.
> >>>
> >>> AFAIK, partial THP unmap may need the _mapcount information of every
> >>> subpage otherwise the deferred split can't know what subpages could be
> >>> freed.
> >
> > I believe Yang Shi is right insofar as the decision on whether it's worth
> > queuing for deferred split is being done based on those subpage _mapcounts.
> > That is a use I had not considered, and I've given no thought to how
> > important or not it is.
> >
> >>
> >> Could we just scan page tables of a THP during deferred split process
> >> instead? Deferred split is a slow path already, so maybe it can afford
> >> the extra work.
> >
> > But unless I misunderstand, actually carrying out the deferred split
> > already unmaps, uses migration entries, and remaps the remaining ptes:
> > needing no help from subpage _mapcounts to do those, and free the rest.
> 
> You are right. unmap_page() during THP split is scanning the page tables
> already.
> 
> For deciding whether to queue a THP for deferred split, we probably can
> keep PageDoubleMap bit to indicate if any subpage is PTE mapped.

Maybe, maybe not.

> 
> But without subpage _mapcount, detecting extra pins to a THP before split
> might be not as easy as with it. This means every THP split will need to
> perform unmap_page(), then check the remaining page_count to see if
> THP split is possible. That would also introduce extra system-wide overheads
> from unmapping pages. Am I missing anything?

I did not explain clearly enough: a subpage's ptes must still be counted
in total_mapcount(); but I'm suggesting that perhaps they can be counted
all together (either in the head page's _mapcount, or in a separate field
if that works better), instead of being distributed amongst the separate
subpages' _mapcounts.

And this would lower the system-wide overheads inside total_mapcount()
and page_mapped() (and maybe others).

Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
@ 2021-09-24  0:57                     ` Hugh Dickins
  0 siblings, 0 replies; 43+ messages in thread
From: Hugh Dickins @ 2021-09-24  0:57 UTC (permalink / raw)
  To: Zi Yan
  Cc: Hugh Dickins, Yang Shi, Kirill A. Shutemov, Matthew Wilcox,
	Kent Overstreet, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, Linux MM, Johannes Weiner,
	Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

On Thu, 23 Sep 2021, Zi Yan wrote:
> On 23 Sep 2021, at 19:48, Hugh Dickins wrote:
> > On Thu, 23 Sep 2021, Zi Yan wrote:
> >> On 23 Sep 2021, at 17:54, Yang Shi wrote:
> >>> On Thu, Sep 23, 2021 at 2:10 PM Hugh Dickins <hughd@google.com> wrote:
> >>>>
> >>>> NR_FILE_MAPPED being used for /proc/meminfo's "Mapped:" and a couple
> >>>> of other such stats files, and for a reclaim heuristic in mm/vmscan.c.
> >>>>
> >>>> Allow ourselves more slack in NR_FILE_MAPPED accounting (either count
> >>>> each pte as if it mapped the whole THP, or don't count a THP's ptes
> >>>> at all - you opted for the latter in the "Mlocked:" accounting),
> >>>> and I suspect subpage _mapcount could be abandoned.
> >>>
> >>> AFAIK, partial THP unmap may need the _mapcount information of every
> >>> subpage otherwise the deferred split can't know what subpages could be
> >>> freed.
> >
> > I believe Yang Shi is right insofar as the decision on whether it's worth
> > queuing for deferred split is being done based on those subpage _mapcounts.
> > That is a use I had not considered, and I've given no thought to how
> > important or not it is.
> >
> >>
> >> Could we just scan page tables of a THP during deferred split process
> >> instead? Deferred split is a slow path already, so maybe it can afford
> >> the extra work.
> >
> > But unless I misunderstand, actually carrying out the deferred split
> > already unmaps, uses migration entries, and remaps the remaining ptes:
> > needing no help from subpage _mapcounts to do those, and free the rest.
> 
> You are right. unmap_page() during THP split is scanning the page tables
> already.
> 
> For deciding whether to queue a THP for deferred split, we probably can
> keep PageDoubleMap bit to indicate if any subpage is PTE mapped.

Maybe, maybe not.

> 
> But without subpage _mapcount, detecting extra pins to a THP before split
> might be not as easy as with it. This means every THP split will need to
> perform unmap_page(), then check the remaining page_count to see if
> THP split is possible. That would also introduce extra system-wide overheads
> from unmapping pages. Am I missing anything?

I did not explain clearly enough: a subpage's ptes must still be counted
in total_mapcount(); but I'm suggesting that perhaps they can be counted
all together (either in the head page's _mapcount, or in a separate field
if that works better), instead of being distributed amongst the separate
subpages' _mapcounts.

And this would lower the system-wide overheads inside total_mapcount()
and page_mapped() (and maybe others).

Hugh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
  2021-09-23 23:48                 ` Hugh Dickins
@ 2021-09-24  1:11                   ` Yang Shi
  -1 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-09-24  1:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Zi Yan, Kirill A. Shutemov, Matthew Wilcox, Kent Overstreet,
	Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux MM,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

On Thu, Sep 23, 2021 at 4:49 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 23 Sep 2021, Zi Yan wrote:
> > On 23 Sep 2021, at 17:54, Yang Shi wrote:
> > > On Thu, Sep 23, 2021 at 2:10 PM Hugh Dickins <hughd@google.com> wrote:
> > >>
> > >> NR_FILE_MAPPED being used for /proc/meminfo's "Mapped:" and a couple
> > >> of other such stats files, and for a reclaim heuristic in mm/vmscan.c.
> > >>
> > >> Allow ourselves more slack in NR_FILE_MAPPED accounting (either count
> > >> each pte as if it mapped the whole THP, or don't count a THP's ptes
> > >> at all - you opted for the latter in the "Mlocked:" accounting),
> > >> and I suspect subpage _mapcount could be abandoned.
> > >
> > > AFAIK, partial THP unmap may need the _mapcount information of every
> > > subpage otherwise the deferred split can't know what subpages could be
> > > freed.
>
> I believe Yang Shi is right insofar as the decision on whether it's worth
> queuing for deferred split is being done based on those subpage _mapcounts.
> That is a use I had not considered, and I've given no thought to how
> important or not it is.

Anyway deferred split is anon THP specific. We don't have to worry
about this for file THP. So your suggestion about just counting total
mapcount seems feasible to me for file THP at least.

>
> >
> > Could we just scan page tables of a THP during deferred split process
> > instead? Deferred split is a slow path already, so maybe it can afford
> > the extra work.
>
> But unless I misunderstand, actually carrying out the deferred split
> already unmaps, uses migration entries, and remaps the remaining ptes:
> needing no help from subpage _mapcounts to do those, and free the rest.
>
> Hugh

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
@ 2021-09-24  1:11                   ` Yang Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-09-24  1:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Zi Yan, Kirill A. Shutemov, Matthew Wilcox, Kent Overstreet,
	Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux MM,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

On Thu, Sep 23, 2021 at 4:49 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 23 Sep 2021, Zi Yan wrote:
> > On 23 Sep 2021, at 17:54, Yang Shi wrote:
> > > On Thu, Sep 23, 2021 at 2:10 PM Hugh Dickins <hughd@google.com> wrote:
> > >>
> > >> NR_FILE_MAPPED being used for /proc/meminfo's "Mapped:" and a couple
> > >> of other such stats files, and for a reclaim heuristic in mm/vmscan.c.
> > >>
> > >> Allow ourselves more slack in NR_FILE_MAPPED accounting (either count
> > >> each pte as if it mapped the whole THP, or don't count a THP's ptes
> > >> at all - you opted for the latter in the "Mlocked:" accounting),
> > >> and I suspect subpage _mapcount could be abandoned.
> > >
> > > AFAIK, partial THP unmap may need the _mapcount information of every
> > > subpage otherwise the deferred split can't know what subpages could be
> > > freed.
>
> I believe Yang Shi is right insofar as the decision on whether it's worth
> queuing for deferred split is being done based on those subpage _mapcounts.
> That is a use I had not considered, and I've given no thought to how
> important or not it is.

Anyway deferred split is anon THP specific. We don't have to worry
about this for file THP. So your suggestion about just counting total
mapcount seems feasible to me for file THP at least.

>
> >
> > Could we just scan page tables of a THP during deferred split process
> > instead? Deferred split is a slow path already, so maybe it can afford
> > the extra work.
>
> But unless I misunderstand, actually carrying out the deferred split
> already unmaps, uses migration entries, and remaps the remaining ptes:
> needing no help from subpage _mapcounts to do those, and free the rest.
>
> Hugh


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
  2021-09-24  1:11                   ` Yang Shi
  (?)
@ 2021-09-24  1:31                   ` Matthew Wilcox
  2021-09-24  3:26                       ` Yang Shi
  -1 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2021-09-24  1:31 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Zi Yan, Kirill A. Shutemov, Kent Overstreet,
	Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux MM,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

On Thu, Sep 23, 2021 at 06:11:19PM -0700, Yang Shi wrote:
> On Thu, Sep 23, 2021 at 4:49 PM Hugh Dickins <hughd@google.com> wrote:
> > I believe Yang Shi is right insofar as the decision on whether it's worth
> > queuing for deferred split is being done based on those subpage _mapcounts.
> > That is a use I had not considered, and I've given no thought to how
> > important or not it is.
> 
> Anyway deferred split is anon THP specific. We don't have to worry
> about this for file THP. So your suggestion about just counting total
> mapcount seems feasible to me for file THP at least.

But I think we probably *should* do deferred split for file THP.
At the moment, when we truncate to the middle of a shmem THP, we try
a few times to split it and then just give up.  We should probably try
once and then queue it for deferred split.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
  2021-09-24  1:31                   ` Matthew Wilcox
@ 2021-09-24  3:26                       ` Yang Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-09-24  3:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Zi Yan, Kirill A. Shutemov, Kent Overstreet,
	Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux MM,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

On Thu, Sep 23, 2021 at 6:32 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Sep 23, 2021 at 06:11:19PM -0700, Yang Shi wrote:
> > On Thu, Sep 23, 2021 at 4:49 PM Hugh Dickins <hughd@google.com> wrote:
> > > I believe Yang Shi is right insofar as the decision on whether it's worth
> > > queuing for deferred split is being done based on those subpage _mapcounts.
> > > That is a use I had not considered, and I've given no thought to how
> > > important or not it is.
> >
> > Anyway deferred split is anon THP specific. We don't have to worry
> > about this for file THP. So your suggestion about just counting total
> > mapcount seems feasible to me for file THP at least.
>
> But I think we probably *should* do deferred split for file THP.
> At the moment, when we truncate to the middle of a shmem THP, we try
> a few times to split it and then just give up.  We should probably try
> once and then queue it for deferred split.

Yes, probably. Anyway this doesn't need _mapcount of subpages.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
@ 2021-09-24  3:26                       ` Yang Shi
  0 siblings, 0 replies; 43+ messages in thread
From: Yang Shi @ 2021-09-24  3:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Zi Yan, Kirill A. Shutemov, Kent Overstreet,
	Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux MM,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells, Mike Kravetz

On Thu, Sep 23, 2021 at 6:32 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Sep 23, 2021 at 06:11:19PM -0700, Yang Shi wrote:
> > On Thu, Sep 23, 2021 at 4:49 PM Hugh Dickins <hughd@google.com> wrote:
> > > I believe Yang Shi is right insofar as the decision on whether it's worth
> > > queuing for deferred split is being done based on those subpage _mapcounts.
> > > That is a use I had not considered, and I've given no thought to how
> > > important or not it is.
> >
> > Anyway deferred split is anon THP specific. We don't have to worry
> > about this for file THP. So your suggestion about just counting total
> > mapcount seems feasible to me for file THP at least.
>
> But I think we probably *should* do deferred split for file THP.
> At the moment, when we truncate to the middle of a shmem THP, we try
> a few times to split it and then just give up.  We should probably try
> once and then queue it for deferred split.

Yes, probably. Anyway this doesn't need _mapcount of subpages.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Mapcount of subpages
  2021-09-23 21:10           ` Hugh Dickins
  (?)
  (?)
@ 2021-09-24 23:05           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 43+ messages in thread
From: Kirill A. Shutemov @ 2021-09-24 23:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Matthew Wilcox, Kent Overstreet, linux-fsdevel, linux-kernel,
	linux-mm, Johannes Weiner, Linus Torvalds, Andrew Morton,
	Darrick J. Wong, Christoph Hellwig, David Howells, Mike Kravetz

On Thu, Sep 23, 2021 at 02:10:13PM -0700, Hugh Dickins wrote:
> But you have a different point in mind when you refer to superfluous
> CoW and GUP: I don't know the score there (and I think we are still in
> that halfway zone, since pte CoW was changed to depend on page_count,
> but THP CoW still depending on mapcount).

I didn't pay enough attention to the topic when the change to depend on
page_count was made. I need to catch up.

I look at what direction Andrea went in his patchset and so far I *feel*
he has a point[1]. I have not read the whole thing yet and I don't have a
firm position here, but maybe we need to get to the bottom of the topic
before considering ditching per-subpage mapcount.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/tree/mm/memory.c?h=mapcount_deshare&id=e1cb3108d4131c2a7da03fbd37c3230cf082bfd9#n3153

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-23  1:21 Struct page proposal Kent Overstreet
  2021-09-23  3:23 ` Matthew Wilcox
  2021-09-23  9:03 ` Struct page proposal David Hildenbrand
@ 2021-09-27 17:48 ` Vlastimil Babka
  2021-09-27 17:53   ` Kent Overstreet
  2021-09-27 18:05   ` Matthew Wilcox
  2 siblings, 2 replies; 43+ messages in thread
From: Vlastimil Babka @ 2021-09-27 17:48 UTC (permalink / raw)
  To: Kent Overstreet, linux-fsdevel, linux-kernel, linux-mm
  Cc: Johannes Weiner, Matthew Wilcox, Linus Torvalds, Andrew Morton,
	Darrick J. Wong, Christoph Hellwig, David Howells

On 9/23/21 03:21, Kent Overstreet wrote:
> So if we have this:
> 
> struct page {
> 	unsigned long	allocator;
> 	unsigned long	allocatee;
> };
> 
> The allocator field would be used for either a pointer to slab/slub's state, if
> it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> allocation - like compound order today, and probably whether or not the
> (compound group of) pages is free.

The "free page in buddy allocator" case will be interesting to implement.
What the buddy allocator uses today is:

- PageBuddy - determine if page is free; a page_type (part of mapcount
field) today, could be a bit in "allocator" field that would have to be 0 in
all other "page is allocated" contexts.
- nid/zid - to prevent merging accross node/zone boundaries, now part of
page flags
- buddy order
- a list_head (reusing the "lru") to hold the struct page on the appropriate
free list, which has to be double-linked so page can be taken from the
middle of the list instantly

Won't be easy to cram all that into two unsigned long's, or even a single
one. We should avoid storing anything in the free page itself. Allocating
some external structures to track free pages is going to have funny
bootstrap problems. Probably a major redesign would be needed...


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 17:48 ` Vlastimil Babka
@ 2021-09-27 17:53   ` Kent Overstreet
  2021-09-27 18:34       ` Linus Torvalds
  2021-09-27 18:05   ` Matthew Wilcox
  1 sibling, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2021-09-27 17:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-fsdevel, linux-kernel, linux-mm, Johannes Weiner,
	Matthew Wilcox, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On Mon, Sep 27, 2021 at 07:48:15PM +0200, Vlastimil Babka wrote:
> On 9/23/21 03:21, Kent Overstreet wrote:
> > So if we have this:
> > 
> > struct page {
> > 	unsigned long	allocator;
> > 	unsigned long	allocatee;
> > };
> > 
> > The allocator field would be used for either a pointer to slab/slub's state, if
> > it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> > allocation - like compound order today, and probably whether or not the
> > (compound group of) pages is free.
> 
> The "free page in buddy allocator" case will be interesting to implement.
> What the buddy allocator uses today is:
> 
> - PageBuddy - determine if page is free; a page_type (part of mapcount
> field) today, could be a bit in "allocator" field that would have to be 0 in
> all other "page is allocated" contexts.
> - nid/zid - to prevent merging accross node/zone boundaries, now part of
> page flags
> - buddy order
> - a list_head (reusing the "lru") to hold the struct page on the appropriate
> free list, which has to be double-linked so page can be taken from the
> middle of the list instantly

That list_head is the problematic one. Why do we need to be able to take a page
from the middle of a freelist?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 17:48 ` Vlastimil Babka
  2021-09-27 17:53   ` Kent Overstreet
@ 2021-09-27 18:05   ` Matthew Wilcox
  2021-09-27 18:09     ` Kent Overstreet
  2021-09-27 18:33     ` Kirill A. Shutemov
  1 sibling, 2 replies; 43+ messages in thread
From: Matthew Wilcox @ 2021-09-27 18:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kent Overstreet, linux-fsdevel, linux-kernel, linux-mm,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On Mon, Sep 27, 2021 at 07:48:15PM +0200, Vlastimil Babka wrote:
> On 9/23/21 03:21, Kent Overstreet wrote:
> > So if we have this:
> > 
> > struct page {
> > 	unsigned long	allocator;
> > 	unsigned long	allocatee;
> > };
> > 
> > The allocator field would be used for either a pointer to slab/slub's state, if
> > it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> > allocation - like compound order today, and probably whether or not the
> > (compound group of) pages is free.
> 
> The "free page in buddy allocator" case will be interesting to implement.
> What the buddy allocator uses today is:
> 
> - PageBuddy - determine if page is free; a page_type (part of mapcount
> field) today, could be a bit in "allocator" field that would have to be 0 in
> all other "page is allocated" contexts.
> - nid/zid - to prevent merging accross node/zone boundaries, now part of
> page flags
> - buddy order
> - a list_head (reusing the "lru") to hold the struct page on the appropriate
> free list, which has to be double-linked so page can be taken from the
> middle of the list instantly
> 
> Won't be easy to cram all that into two unsigned long's, or even a single
> one. We should avoid storing anything in the free page itself. Allocating
> some external structures to track free pages is going to have funny
> bootstrap problems. Probably a major redesign would be needed...

Wait, why do we want to avoid using the memory that we're allocating?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 18:05   ` Matthew Wilcox
@ 2021-09-27 18:09     ` Kent Overstreet
  2021-09-27 18:12       ` Matthew Wilcox
  2021-09-27 19:07       ` Vlastimil Babka
  2021-09-27 18:33     ` Kirill A. Shutemov
  1 sibling, 2 replies; 43+ messages in thread
From: Kent Overstreet @ 2021-09-27 18:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, linux-fsdevel, linux-kernel, linux-mm,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On Mon, Sep 27, 2021 at 07:05:26PM +0100, Matthew Wilcox wrote:
> On Mon, Sep 27, 2021 at 07:48:15PM +0200, Vlastimil Babka wrote:
> > On 9/23/21 03:21, Kent Overstreet wrote:
> > > So if we have this:
> > > 
> > > struct page {
> > > 	unsigned long	allocator;
> > > 	unsigned long	allocatee;
> > > };
> > > 
> > > The allocator field would be used for either a pointer to slab/slub's state, if
> > > it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> > > allocation - like compound order today, and probably whether or not the
> > > (compound group of) pages is free.
> > 
> > The "free page in buddy allocator" case will be interesting to implement.
> > What the buddy allocator uses today is:
> > 
> > - PageBuddy - determine if page is free; a page_type (part of mapcount
> > field) today, could be a bit in "allocator" field that would have to be 0 in
> > all other "page is allocated" contexts.
> > - nid/zid - to prevent merging accross node/zone boundaries, now part of
> > page flags
> > - buddy order
> > - a list_head (reusing the "lru") to hold the struct page on the appropriate
> > free list, which has to be double-linked so page can be taken from the
> > middle of the list instantly
> > 
> > Won't be easy to cram all that into two unsigned long's, or even a single
> > one. We should avoid storing anything in the free page itself. Allocating
> > some external structures to track free pages is going to have funny
> > bootstrap problems. Probably a major redesign would be needed...
> 
> Wait, why do we want to avoid using the memory that we're allocating?

The issue is where to stick the state for free pages. If that doesn't fit in two
ulongs, then we'd need a separate allocation, which means slab needs to be up
and running before free pages are initialized.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 18:09     ` Kent Overstreet
@ 2021-09-27 18:12       ` Matthew Wilcox
  2021-09-27 18:16         ` David Hildenbrand
  2021-09-27 18:16         ` Kent Overstreet
  2021-09-27 19:07       ` Vlastimil Babka
  1 sibling, 2 replies; 43+ messages in thread
From: Matthew Wilcox @ 2021-09-27 18:12 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka, linux-fsdevel, linux-kernel, linux-mm,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On Mon, Sep 27, 2021 at 02:09:49PM -0400, Kent Overstreet wrote:
> On Mon, Sep 27, 2021 at 07:05:26PM +0100, Matthew Wilcox wrote:
> > On Mon, Sep 27, 2021 at 07:48:15PM +0200, Vlastimil Babka wrote:
> > > On 9/23/21 03:21, Kent Overstreet wrote:
> > > > So if we have this:
> > > > 
> > > > struct page {
> > > > 	unsigned long	allocator;
> > > > 	unsigned long	allocatee;
> > > > };
> > > > 
> > > > The allocator field would be used for either a pointer to slab/slub's state, if
> > > > it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> > > > allocation - like compound order today, and probably whether or not the
> > > > (compound group of) pages is free.
> > > 
> > > The "free page in buddy allocator" case will be interesting to implement.
> > > What the buddy allocator uses today is:
> > > 
> > > - PageBuddy - determine if page is free; a page_type (part of mapcount
> > > field) today, could be a bit in "allocator" field that would have to be 0 in
> > > all other "page is allocated" contexts.
> > > - nid/zid - to prevent merging accross node/zone boundaries, now part of
> > > page flags
> > > - buddy order
> > > - a list_head (reusing the "lru") to hold the struct page on the appropriate
> > > free list, which has to be double-linked so page can be taken from the
> > > middle of the list instantly
> > > 
> > > Won't be easy to cram all that into two unsigned long's, or even a single
> > > one. We should avoid storing anything in the free page itself. Allocating
> > > some external structures to track free pages is going to have funny
> > > bootstrap problems. Probably a major redesign would be needed...
> > 
> > Wait, why do we want to avoid using the memory that we're allocating?
> 
> The issue is where to stick the state for free pages. If that doesn't fit in two
> ulongs, then we'd need a separate allocation, which means slab needs to be up
> and running before free pages are initialized.

But the thing we're allocating is at least PAGE_SIZE bytes in size.
Why is "We should avoid storing anything in the free page itself" true?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 18:12       ` Matthew Wilcox
@ 2021-09-27 18:16         ` David Hildenbrand
  2021-09-27 18:53           ` Vlastimil Babka
  2021-09-27 18:16         ` Kent Overstreet
  1 sibling, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2021-09-27 18:16 UTC (permalink / raw)
  To: Matthew Wilcox, Kent Overstreet
  Cc: Vlastimil Babka, linux-fsdevel, linux-kernel, linux-mm,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On 27.09.21 20:12, Matthew Wilcox wrote:
> On Mon, Sep 27, 2021 at 02:09:49PM -0400, Kent Overstreet wrote:
>> On Mon, Sep 27, 2021 at 07:05:26PM +0100, Matthew Wilcox wrote:
>>> On Mon, Sep 27, 2021 at 07:48:15PM +0200, Vlastimil Babka wrote:
>>>> On 9/23/21 03:21, Kent Overstreet wrote:
>>>>> So if we have this:
>>>>>
>>>>> struct page {
>>>>> 	unsigned long	allocator;
>>>>> 	unsigned long	allocatee;
>>>>> };
>>>>>
>>>>> The allocator field would be used for either a pointer to slab/slub's state, if
>>>>> it's a slab page, or if it's a buddy allocator page it'd encode the order of the
>>>>> allocation - like compound order today, and probably whether or not the
>>>>> (compound group of) pages is free.
>>>>
>>>> The "free page in buddy allocator" case will be interesting to implement.
>>>> What the buddy allocator uses today is:
>>>>
>>>> - PageBuddy - determine if page is free; a page_type (part of mapcount
>>>> field) today, could be a bit in "allocator" field that would have to be 0 in
>>>> all other "page is allocated" contexts.
>>>> - nid/zid - to prevent merging accross node/zone boundaries, now part of
>>>> page flags
>>>> - buddy order
>>>> - a list_head (reusing the "lru") to hold the struct page on the appropriate
>>>> free list, which has to be double-linked so page can be taken from the
>>>> middle of the list instantly
>>>>
>>>> Won't be easy to cram all that into two unsigned long's, or even a single
>>>> one. We should avoid storing anything in the free page itself. Allocating
>>>> some external structures to track free pages is going to have funny
>>>> bootstrap problems. Probably a major redesign would be needed...
>>>
>>> Wait, why do we want to avoid using the memory that we're allocating?
>>
>> The issue is where to stick the state for free pages. If that doesn't fit in two
>> ulongs, then we'd need a separate allocation, which means slab needs to be up
>> and running before free pages are initialized.
> 
> But the thing we're allocating is at least PAGE_SIZE bytes in size.
> Why is "We should avoid storing anything in the free page itself" true?
> 

Immediately comes to mind:
* Free page reporting via virtio-balloon
* CMA on s390x (see arch_free_page())
* Free page poisoning
* init_on_free

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 18:12       ` Matthew Wilcox
  2021-09-27 18:16         ` David Hildenbrand
@ 2021-09-27 18:16         ` Kent Overstreet
  2021-09-28  3:19           ` Matthew Wilcox
  1 sibling, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2021-09-27 18:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, linux-fsdevel, linux-kernel, linux-mm,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On Mon, Sep 27, 2021 at 07:12:19PM +0100, Matthew Wilcox wrote:
> On Mon, Sep 27, 2021 at 02:09:49PM -0400, Kent Overstreet wrote:
> > On Mon, Sep 27, 2021 at 07:05:26PM +0100, Matthew Wilcox wrote:
> > > On Mon, Sep 27, 2021 at 07:48:15PM +0200, Vlastimil Babka wrote:
> > > > On 9/23/21 03:21, Kent Overstreet wrote:
> > > > > So if we have this:
> > > > > 
> > > > > struct page {
> > > > > 	unsigned long	allocator;
> > > > > 	unsigned long	allocatee;
> > > > > };
> > > > > 
> > > > > The allocator field would be used for either a pointer to slab/slub's state, if
> > > > > it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> > > > > allocation - like compound order today, and probably whether or not the
> > > > > (compound group of) pages is free.
> > > > 
> > > > The "free page in buddy allocator" case will be interesting to implement.
> > > > What the buddy allocator uses today is:
> > > > 
> > > > - PageBuddy - determine if page is free; a page_type (part of mapcount
> > > > field) today, could be a bit in "allocator" field that would have to be 0 in
> > > > all other "page is allocated" contexts.
> > > > - nid/zid - to prevent merging accross node/zone boundaries, now part of
> > > > page flags
> > > > - buddy order
> > > > - a list_head (reusing the "lru") to hold the struct page on the appropriate
> > > > free list, which has to be double-linked so page can be taken from the
> > > > middle of the list instantly
> > > > 
> > > > Won't be easy to cram all that into two unsigned long's, or even a single
> > > > one. We should avoid storing anything in the free page itself. Allocating
> > > > some external structures to track free pages is going to have funny
> > > > bootstrap problems. Probably a major redesign would be needed...
> > > 
> > > Wait, why do we want to avoid using the memory that we're allocating?
> > 
> > The issue is where to stick the state for free pages. If that doesn't fit in two
> > ulongs, then we'd need a separate allocation, which means slab needs to be up
> > and running before free pages are initialized.
> 
> But the thing we're allocating is at least PAGE_SIZE bytes in size.
> Why is "We should avoid storing anything in the free page itself" true?

Good point!

Highmem and dax do complicate things though - would they make it too much of a
hassle? You want to get rid of struct page for dax (what's the right term for
that kind of memory?), but we're not there yet, right?

Very curious why we need to be able to pull pages off the middle of a freelist.
If we can make do with singly linked freelists, then I think two ulongs would be
sufficient.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 18:05   ` Matthew Wilcox
  2021-09-27 18:09     ` Kent Overstreet
@ 2021-09-27 18:33     ` Kirill A. Shutemov
  1 sibling, 0 replies; 43+ messages in thread
From: Kirill A. Shutemov @ 2021-09-27 18:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, Kent Overstreet, linux-fsdevel, linux-kernel,
	linux-mm, Johannes Weiner, Linus Torvalds, Andrew Morton,
	Darrick J. Wong, Christoph Hellwig, David Howells

On Mon, Sep 27, 2021 at 07:05:26PM +0100, Matthew Wilcox wrote:
> On Mon, Sep 27, 2021 at 07:48:15PM +0200, Vlastimil Babka wrote:
> > On 9/23/21 03:21, Kent Overstreet wrote:
> > > So if we have this:
> > > 
> > > struct page {
> > > 	unsigned long	allocator;
> > > 	unsigned long	allocatee;
> > > };
> > > 
> > > The allocator field would be used for either a pointer to slab/slub's state, if
> > > it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> > > allocation - like compound order today, and probably whether or not the
> > > (compound group of) pages is free.
> > 
> > The "free page in buddy allocator" case will be interesting to implement.
> > What the buddy allocator uses today is:
> > 
> > - PageBuddy - determine if page is free; a page_type (part of mapcount
> > field) today, could be a bit in "allocator" field that would have to be 0 in
> > all other "page is allocated" contexts.
> > - nid/zid - to prevent merging accross node/zone boundaries, now part of
> > page flags
> > - buddy order
> > - a list_head (reusing the "lru") to hold the struct page on the appropriate
> > free list, which has to be double-linked so page can be taken from the
> > middle of the list instantly
> > 
> > Won't be easy to cram all that into two unsigned long's, or even a single
> > one. We should avoid storing anything in the free page itself. Allocating
> > some external structures to track free pages is going to have funny
> > bootstrap problems. Probably a major redesign would be needed...
> 
> Wait, why do we want to avoid using the memory that we're allocating?

Intel TDX and AMD-SEV have concept of unaccpeted memory. You cannot use
the memory until it got "accepted". The acceptance is costly and I made a
patchset[1] to pospone the accaptance until the first allocation. So pages
are on free list, but page type indicate that it has to go though
additional step on allocation.

[1] https://lore.kernel.org/all/20210810062626.1012-1-kirill.shutemov@linux.intel.com/

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 17:53   ` Kent Overstreet
@ 2021-09-27 18:34       ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2021-09-27 18:34 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka, linux-fsdevel, Linux Kernel Mailing List,
	Linux-MM, Johannes Weiner, Matthew Wilcox, Andrew Morton,
	Darrick J. Wong, Christoph Hellwig, David Howells

On Mon, Sep 27, 2021 at 10:54 AM Kent Overstreet
<kent.overstreet@gmail.com> wrote:
>
> That list_head is the problematic one. Why do we need to be able to take a page
> from the middle of a freelist?

At least for the merging with the buddy page case.

          Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
@ 2021-09-27 18:34       ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2021-09-27 18:34 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka, linux-fsdevel, Linux Kernel Mailing List,
	Linux-MM, Johannes Weiner, Matthew Wilcox, Andrew Morton,
	Darrick J. Wong, Christoph Hellwig, David Howells

On Mon, Sep 27, 2021 at 10:54 AM Kent Overstreet
<kent.overstreet@gmail.com> wrote:
>
> That list_head is the problematic one. Why do we need to be able to take a page
> from the middle of a freelist?

At least for the merging with the buddy page case.

          Linus


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 18:16         ` David Hildenbrand
@ 2021-09-27 18:53           ` Vlastimil Babka
  2021-09-27 19:04               ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Vlastimil Babka @ 2021-09-27 18:53 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox, Kent Overstreet
  Cc: linux-fsdevel, linux-kernel, linux-mm, Johannes Weiner,
	Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On 9/27/2021 8:16 PM, David Hildenbrand wrote:
> On 27.09.21 20:12, Matthew Wilcox wrote:
>>
>> But the thing we're allocating is at least PAGE_SIZE bytes in size.
>> Why is "We should avoid storing anything in the free page itself" true?
>>
> 
> Immediately comes to mind:
> * Free page reporting via virtio-balloon
> * CMA on s390x (see arch_free_page())
> * Free page poisoning
> * init_on_free

I was thinking of debug_pagealloc (unmaps free pages from direct map) but yeah,
the list is longer.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 18:53           ` Vlastimil Babka
@ 2021-09-27 19:04               ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2021-09-27 19:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Hildenbrand, Matthew Wilcox, Kent Overstreet,
	linux-fsdevel, Linux Kernel Mailing List, Linux-MM,
	Johannes Weiner, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On Mon, Sep 27, 2021 at 11:53 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> I was thinking of debug_pagealloc (unmaps free pages from direct map) but yeah,
> the list is longer.

In fact, the _original_ free page list was in the page itself, not in
'struct page'.

The original reason to move it into 'struct page' ended up being
performance, iirc.

Because of how now the free page list was always in the same cache
line set, the page allocator caused horrendous cache patterns on
direct-mapped caches.

Direct-mapped caches may thankfully be gone, and we have a lot of
other things that end up having that property of "same offset within a
page" just because of allocation patterns (task struct allocations
being but one example), but it might still be something to try to
avoid.

               Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
@ 2021-09-27 19:04               ` Linus Torvalds
  0 siblings, 0 replies; 43+ messages in thread
From: Linus Torvalds @ 2021-09-27 19:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Hildenbrand, Matthew Wilcox, Kent Overstreet,
	linux-fsdevel, Linux Kernel Mailing List, Linux-MM,
	Johannes Weiner, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On Mon, Sep 27, 2021 at 11:53 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> I was thinking of debug_pagealloc (unmaps free pages from direct map) but yeah,
> the list is longer.

In fact, the _original_ free page list was in the page itself, not in
'struct page'.

The original reason to move it into 'struct page' ended up being
performance, iirc.

Because of how now the free page list was always in the same cache
line set, the page allocator caused horrendous cache patterns on
direct-mapped caches.

Direct-mapped caches may thankfully be gone, and we have a lot of
other things that end up having that property of "same offset within a
page" just because of allocation patterns (task struct allocations
being but one example), but it might still be something to try to
avoid.

               Linus


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 18:09     ` Kent Overstreet
  2021-09-27 18:12       ` Matthew Wilcox
@ 2021-09-27 19:07       ` Vlastimil Babka
  2021-09-27 20:14         ` Kent Overstreet
  2021-09-28 11:21         ` David Laight
  1 sibling, 2 replies; 43+ messages in thread
From: Vlastimil Babka @ 2021-09-27 19:07 UTC (permalink / raw)
  To: Kent Overstreet, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-mm, Johannes Weiner,
	Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On 9/27/2021 8:09 PM, Kent Overstreet wrote:
> On Mon, Sep 27, 2021 at 07:05:26PM +0100, Matthew Wilcox wrote:
>> On Mon, Sep 27, 2021 at 07:48:15PM +0200, Vlastimil Babka wrote:
>>> Won't be easy to cram all that into two unsigned long's, or even a single
>>> one. We should avoid storing anything in the free page itself. Allocating
>>> some external structures to track free pages is going to have funny
>>> bootstrap problems. Probably a major redesign would be needed...
>>
>> Wait, why do we want to avoid using the memory that we're allocating?
> 
> The issue is where to stick the state for free pages. If that doesn't fit in two
> ulongs, then we'd need a separate allocation, which means slab needs to be up
> and running before free pages are initialized.

So that's what I meant by the funny bootstrap problems - slab allocates pages
from the buddy allocator. And well, not just bootstrap, imagine free memory
becomes low, we need to reclaim pages, and in order to turn full pages to free
buddy pages we need to allocate these slab structures, and the slab is full too
and needs to allocate more backing pages...

By "major redesign" I meant e.g. something along - bitmaps of free pages per
each order? (instead of the free lists) Hm but I guess no, the worst case times
searching for free pages would just suck...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 19:07       ` Vlastimil Babka
@ 2021-09-27 20:14         ` Kent Overstreet
  2021-09-28 11:21         ` David Laight
  1 sibling, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2021-09-27 20:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, linux-fsdevel, linux-kernel, linux-mm,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On Mon, Sep 27, 2021 at 09:07:21PM +0200, Vlastimil Babka wrote:
> On 9/27/2021 8:09 PM, Kent Overstreet wrote:
> > On Mon, Sep 27, 2021 at 07:05:26PM +0100, Matthew Wilcox wrote:
> >> On Mon, Sep 27, 2021 at 07:48:15PM +0200, Vlastimil Babka wrote:
> >>> Won't be easy to cram all that into two unsigned long's, or even a single
> >>> one. We should avoid storing anything in the free page itself. Allocating
> >>> some external structures to track free pages is going to have funny
> >>> bootstrap problems. Probably a major redesign would be needed...
> >>
> >> Wait, why do we want to avoid using the memory that we're allocating?
> > 
> > The issue is where to stick the state for free pages. If that doesn't fit in two
> > ulongs, then we'd need a separate allocation, which means slab needs to be up
> > and running before free pages are initialized.
> 
> So that's what I meant by the funny bootstrap problems - slab allocates pages
> from the buddy allocator. And well, not just bootstrap, imagine free memory
> becomes low, we need to reclaim pages, and in order to turn full pages to free
> buddy pages we need to allocate these slab structures, and the slab is full too
> and needs to allocate more backing pages...
> 
> By "major redesign" I meant e.g. something along - bitmaps of free pages per
> each order? (instead of the free lists) Hm but I guess no, the worst case times
> searching for free pages would just suck...

We can have arrays of pointers to free pages - then within struct page we'd keep
the index of that page in the freelist array. To take a page off the middle of
the freelist we'd just swap it with the one at the end. And instead of using a
literal array, we'd want to use a simple radix tree.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 18:34       ` Linus Torvalds
  (?)
@ 2021-09-27 20:45       ` David Hildenbrand
  -1 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2021-09-27 20:45 UTC (permalink / raw)
  To: Linus Torvalds, Kent Overstreet
  Cc: Vlastimil Babka, linux-fsdevel, Linux Kernel Mailing List,
	Linux-MM, Johannes Weiner, Matthew Wilcox, Andrew Morton,
	Darrick J. Wong, Christoph Hellwig, David Howells

On 27.09.21 20:34, Linus Torvalds wrote:
> On Mon, Sep 27, 2021 at 10:54 AM Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
>>
>> That list_head is the problematic one. Why do we need to be able to take a page
>> from the middle of a freelist?
> 
> At least for the merging with the buddy page case.

And alloc_contig_range(), page reporting, page poisioning ... and 
probably more.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Struct page proposal
  2021-09-27 18:16         ` Kent Overstreet
@ 2021-09-28  3:19           ` Matthew Wilcox
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Wilcox @ 2021-09-28  3:19 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka, linux-fsdevel, linux-kernel, linux-mm,
	Johannes Weiner, Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

On Mon, Sep 27, 2021 at 02:16:53PM -0400, Kent Overstreet wrote:
> On Mon, Sep 27, 2021 at 07:12:19PM +0100, Matthew Wilcox wrote:
> > On Mon, Sep 27, 2021 at 02:09:49PM -0400, Kent Overstreet wrote:
> > > On Mon, Sep 27, 2021 at 07:05:26PM +0100, Matthew Wilcox wrote:
> > > > On Mon, Sep 27, 2021 at 07:48:15PM +0200, Vlastimil Babka wrote:
> > > > > On 9/23/21 03:21, Kent Overstreet wrote:
> > > > > > So if we have this:
> > > > > > 
> > > > > > struct page {
> > > > > > 	unsigned long	allocator;
> > > > > > 	unsigned long	allocatee;
> > > > > > };
> > > > > > 
> > > > > > The allocator field would be used for either a pointer to slab/slub's state, if
> > > > > > it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> > > > > > allocation - like compound order today, and probably whether or not the
> > > > > > (compound group of) pages is free.
> > > > > 
> > > > > The "free page in buddy allocator" case will be interesting to implement.
> > > > > What the buddy allocator uses today is:
> > > > > 
> > > > > - PageBuddy - determine if page is free; a page_type (part of mapcount
> > > > > field) today, could be a bit in "allocator" field that would have to be 0 in
> > > > > all other "page is allocated" contexts.
> > > > > - nid/zid - to prevent merging accross node/zone boundaries, now part of
> > > > > page flags
> > > > > - buddy order
> > > > > - a list_head (reusing the "lru") to hold the struct page on the appropriate
> > > > > free list, which has to be double-linked so page can be taken from the
> > > > > middle of the list instantly
> > > > > 
> > > > > Won't be easy to cram all that into two unsigned long's, or even a single
> > > > > one. We should avoid storing anything in the free page itself. Allocating
> > > > > some external structures to track free pages is going to have funny
> > > > > bootstrap problems. Probably a major redesign would be needed...
> > > > 
> > > > Wait, why do we want to avoid using the memory that we're allocating?
> > > 
> > > The issue is where to stick the state for free pages. If that doesn't fit in two
> > > ulongs, then we'd need a separate allocation, which means slab needs to be up
> > > and running before free pages are initialized.
> > 
> > But the thing we're allocating is at least PAGE_SIZE bytes in size.
> > Why is "We should avoid storing anything in the free page itself" true?
> 
> Good point!
> 
> Highmem and dax do complicate things though - would they make it too much of a
> hassle? You want to get rid of struct page for dax (what's the right term for
> that kind of memory?), but we're not there yet, right?

DAX is used for persistent memory, often abbreviated to pmem.

"Getting there" involves rooting out struct page from all kinds of data
structures.  sg lists are the obvious place to start so that we can do
I/O to memory that's not backed by a struct page.  Get that working and
the rent-a-VM companies will love you.  Right now, we either pay the 1.6%
tax twice (once for the struct pages in the host, and once in the guest),
or we have horrendous hacks to create struct pages on the fly so the
host can do I/O to the guest's memory.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: Struct page proposal
  2021-09-27 19:07       ` Vlastimil Babka
  2021-09-27 20:14         ` Kent Overstreet
@ 2021-09-28 11:21         ` David Laight
  1 sibling, 0 replies; 43+ messages in thread
From: David Laight @ 2021-09-28 11:21 UTC (permalink / raw)
  To: 'Vlastimil Babka', Kent Overstreet, Matthew Wilcox
  Cc: linux-fsdevel, linux-kernel, linux-mm, Johannes Weiner,
	Linus Torvalds, Andrew Morton, Darrick J. Wong,
	Christoph Hellwig, David Howells

...
> By "major redesign" I meant e.g. something along - bitmaps of free pages per
> each order? (instead of the free lists) Hm but I guess no, the worst case times
> searching for free pages would just suck...

Arrays of pointers are more cache-friendly than linked lists.
But you may need the 'array of pointers' to actually be a
linked list!
While you might need to extend the list for a 'free', if it is
a list of pages you've always got one to hand.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2021-09-28 11:21 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-23  1:21 Struct page proposal Kent Overstreet
2021-09-23  3:23 ` Matthew Wilcox
2021-09-23  5:15   ` Kent Overstreet
2021-09-23 11:40     ` Mapcount of subpages Matthew Wilcox
2021-09-23 12:45       ` Kirill A. Shutemov
2021-09-23 21:10         ` Hugh Dickins
2021-09-23 21:10           ` Hugh Dickins
2021-09-23 21:54           ` Yang Shi
2021-09-23 21:54             ` Yang Shi
2021-09-23 22:23             ` Zi Yan
2021-09-23 23:48               ` Hugh Dickins
2021-09-23 23:48                 ` Hugh Dickins
2021-09-24  0:25                 ` Zi Yan
2021-09-24  0:57                   ` Hugh Dickins
2021-09-24  0:57                     ` Hugh Dickins
2021-09-24  1:11                 ` Yang Shi
2021-09-24  1:11                   ` Yang Shi
2021-09-24  1:31                   ` Matthew Wilcox
2021-09-24  3:26                     ` Yang Shi
2021-09-24  3:26                       ` Yang Shi
2021-09-24 23:05           ` Kirill A. Shutemov
2021-09-23 18:56       ` Mike Kravetz
2021-09-23  9:03 ` Struct page proposal David Hildenbrand
2021-09-23 15:22   ` Kent Overstreet
2021-09-23 15:34     ` David Hildenbrand
2021-09-27 17:48 ` Vlastimil Babka
2021-09-27 17:53   ` Kent Overstreet
2021-09-27 18:34     ` Linus Torvalds
2021-09-27 18:34       ` Linus Torvalds
2021-09-27 20:45       ` David Hildenbrand
2021-09-27 18:05   ` Matthew Wilcox
2021-09-27 18:09     ` Kent Overstreet
2021-09-27 18:12       ` Matthew Wilcox
2021-09-27 18:16         ` David Hildenbrand
2021-09-27 18:53           ` Vlastimil Babka
2021-09-27 19:04             ` Linus Torvalds
2021-09-27 19:04               ` Linus Torvalds
2021-09-27 18:16         ` Kent Overstreet
2021-09-28  3:19           ` Matthew Wilcox
2021-09-27 19:07       ` Vlastimil Babka
2021-09-27 20:14         ` Kent Overstreet
2021-09-28 11:21         ` David Laight
2021-09-27 18:33     ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.