linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] State Of The Page
@ 2023-01-26 16:40 Matthew Wilcox
  2023-02-21 16:57 ` David Howells
  2023-02-21 18:08 ` Gao Xiang
  0 siblings, 2 replies; 26+ messages in thread
From: Matthew Wilcox @ 2023-01-26 16:40 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, linux-mm, linux-block, linux-ide, linux-scsi,
	linux-nvme, bpf

I'd like to do another session on how the struct page dismemberment
is going and what remains to be done.  Given how widely struct page is
used, I think there will be interest from more than just MM, so I'd
suggest a plenary session.

If I were hosting this session today, topics would include:

Splitting out users:

 - slab (done!)
 - netmem (in progress)
 - hugetlb (in akpm)
 - tail pages (in akpm)
 - page tables
 - ZONE_DEVICE

Users that really should have their own types:

 - zsmalloc
 - bootmem
 - percpu
 - buddy
 - vmalloc

Converting filesystems to folios:

 - XFS (done)
 - AFS (done)
 - NFS (in progress)
 - ext4 (in progress)
 - f2fs (in progress)
 - ... others?

Unresolved challenges:

 - mapcount
 - AnonExclusive
 - Splitting anon & file folios apart
 - Removing PG_error & PG_private

This will probably all change before May.

I'd like to nominate Vishal Moola & Sidhartha Kumar as invitees based on
their work to convert various functions from pages to folios.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2023-01-26 16:40 [LSF/MM/BPF TOPIC] State Of The Page Matthew Wilcox
@ 2023-02-21 16:57 ` David Howells
  2023-02-21 18:08 ` Gao Xiang
  1 sibling, 0 replies; 26+ messages in thread
From: David Howells @ 2023-02-21 16:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, lsf-pc, linux-fsdevel, linux-mm, linux-block,
	linux-ide, linux-scsi, linux-nvme, bpf

Matthew Wilcox <willy@infradead.org> wrote:

> I'd like to do another session on how the struct page dismemberment
> is going and what remains to be done.  Given how widely struct page is
> used, I think there will be interest from more than just MM, so I'd
> suggest a plenary session.

I'd certainly be interested in that.  I've recently been looking into
improving folio support in various places, including now splice and pipes.

David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2023-01-26 16:40 [LSF/MM/BPF TOPIC] State Of The Page Matthew Wilcox
  2023-02-21 16:57 ` David Howells
@ 2023-02-21 18:08 ` Gao Xiang
  2023-02-21 19:09   ` Yang Shi
  2023-02-21 19:58   ` Matthew Wilcox
  1 sibling, 2 replies; 26+ messages in thread
From: Gao Xiang @ 2023-02-21 18:08 UTC (permalink / raw)
  To: Matthew Wilcox, lsf-pc
  Cc: linux-fsdevel, linux-mm, linux-block, linux-ide, linux-scsi,
	linux-nvme, bpf



On 2023/1/27 00:40, Matthew Wilcox wrote:
> I'd like to do another session on how the struct page dismemberment
> is going and what remains to be done.  Given how widely struct page is
> used, I think there will be interest from more than just MM, so I'd
> suggest a plenary session.
> 
> If I were hosting this session today, topics would include:
> 
> Splitting out users:
> 
>   - slab (done!)
>   - netmem (in progress)
>   - hugetlb (in akpm)
>   - tail pages (in akpm)
>   - page tables
>   - ZONE_DEVICE
> 
> Users that really should have their own types:
> 
>   - zsmalloc
>   - bootmem
>   - percpu
>   - buddy
>   - vmalloc
> 
> Converting filesystems to folios:
> 
>   - XFS (done)
>   - AFS (done)
>   - NFS (in progress)
>   - ext4 (in progress)
>   - f2fs (in progress)
>   - ... others?
> 
> Unresolved challenges:
> 
>   - mapcount
>   - AnonExclusive
>   - Splitting anon & file folios apart
>   - Removing PG_error & PG_private

I'm interested in this topic too, also I'd like to get some idea of the
future of the page dismemberment timeline so that I can have time to keep
the pace with it since some embedded use cases like Android are
memory-sensitive all the time.

Minor, it seems some apis still use ->lru field to chain bulk pages,
perhaps it needs some changes as well:
https://lore.kernel.org/r/20221222124412.rpnl2vojnx7izoow@techsingularity.net
https://lore.kernel.org/r/20230214190221.1156876-2-shy828301@gmail.com

Thanks,
Gao Xiang

> 
> This will probably all change before May.
> 
> I'd like to nominate Vishal Moola & Sidhartha Kumar as invitees based on
> their work to convert various functions from pages to folios.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2023-02-21 18:08 ` Gao Xiang
@ 2023-02-21 19:09   ` Yang Shi
  2023-02-22  2:40     ` Gao Xiang
  2023-02-21 19:58   ` Matthew Wilcox
  1 sibling, 1 reply; 26+ messages in thread
From: Yang Shi @ 2023-02-21 19:09 UTC (permalink / raw)
  To: Gao Xiang, Mel Gorman
  Cc: Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, linux-block,
	linux-ide, linux-scsi, linux-nvme, bpf

On Tue, Feb 21, 2023 at 10:08 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
>
>
> On 2023/1/27 00:40, Matthew Wilcox wrote:
> > I'd like to do another session on how the struct page dismemberment
> > is going and what remains to be done.  Given how widely struct page is
> > used, I think there will be interest from more than just MM, so I'd
> > suggest a plenary session.
> >
> > If I were hosting this session today, topics would include:
> >
> > Splitting out users:
> >
> >   - slab (done!)
> >   - netmem (in progress)
> >   - hugetlb (in akpm)
> >   - tail pages (in akpm)
> >   - page tables
> >   - ZONE_DEVICE
> >
> > Users that really should have their own types:
> >
> >   - zsmalloc
> >   - bootmem
> >   - percpu
> >   - buddy
> >   - vmalloc
> >
> > Converting filesystems to folios:
> >
> >   - XFS (done)
> >   - AFS (done)
> >   - NFS (in progress)
> >   - ext4 (in progress)
> >   - f2fs (in progress)
> >   - ... others?
> >
> > Unresolved challenges:
> >
> >   - mapcount
> >   - AnonExclusive
> >   - Splitting anon & file folios apart
> >   - Removing PG_error & PG_private
>
> I'm interested in this topic too, also I'd like to get some idea of the
> future of the page dismemberment timeline so that I can have time to keep
> the pace with it since some embedded use cases like Android are
> memory-sensitive all the time.
>
> Minor, it seems some apis still use ->lru field to chain bulk pages,
> perhaps it needs some changes as well:
> https://lore.kernel.org/r/20221222124412.rpnl2vojnx7izoow@techsingularity.net
> https://lore.kernel.org/r/20230214190221.1156876-2-shy828301@gmail.com

The dm-crypt patches don't use list anymore. The bulk allocator still
supports the list version, but so far there is no user, so it may be
gone soon.

>
> Thanks,
> Gao Xiang
>
> >
> > This will probably all change before May.
> >
> > I'd like to nominate Vishal Moola & Sidhartha Kumar as invitees based on
> > their work to convert various functions from pages to folios.
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2023-02-21 18:08 ` Gao Xiang
  2023-02-21 19:09   ` Yang Shi
@ 2023-02-21 19:58   ` Matthew Wilcox
  2023-02-22  2:38     ` Gao Xiang
                       ` (2 more replies)
  1 sibling, 3 replies; 26+ messages in thread
From: Matthew Wilcox @ 2023-02-21 19:58 UTC (permalink / raw)
  To: Gao Xiang
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme, bpf

On Wed, Feb 22, 2023 at 02:08:28AM +0800, Gao Xiang wrote:
> On 2023/1/27 00:40, Matthew Wilcox wrote:
> > I'd like to do another session on how the struct page dismemberment
> > is going and what remains to be done.  Given how widely struct page is
> > used, I think there will be interest from more than just MM, so I'd
> > suggest a plenary session.
> 
> I'm interested in this topic too, also I'd like to get some idea of the
> future of the page dismemberment timeline so that I can have time to keep
> the pace with it since some embedded use cases like Android are
> memory-sensitive all the time.

As you all know, I'm absolutely amazing at project management & planning
and can tell you to the day when a feature will be ready ;-)

My goal for 2023 is to get to a point where we (a) have struct page
reduced to:

struct page {
	unsigned long flags;
	struct list_head lru;
	struct address_space *mapping;
	pgoff_t index;
	unsigned long private;
	atomic_t _mapcount;
	atomic_t _refcount;
	unsigned long memcg_data;
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
	int _last_cpupid;
#endif
};

and (b) can build an allnoconfig kernel with:

struct page {
	unsigned long flags;
	unsigned long padding[5];
	atomic_t _mapcount;
	atomic_t _refcount;
	unsigned long padding2;
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
	int _last_cpupid;
#endif
};

> Minor, it seems some apis still use ->lru field to chain bulk pages,
> perhaps it needs some changes as well:
> https://lore.kernel.org/r/20221222124412.rpnl2vojnx7izoow@techsingularity.net
> https://lore.kernel.org/r/20230214190221.1156876-2-shy828301@gmail.com

Yang Shi covered the actual (non-)use of the list version of the bulk
allocator already, but perhaps more importantly, each page allocated
by the bulk allocator is actually a separately tracked allocation.
So the obvious translation of the bulk allocator from pages to folios
is that it allocates N order-0 folios.

That may not be the best approach for all the users of the bulk allocator,
so we may end up doing something different.  At any rate, use of page->lru
isn't the problem here (yes, it's something that would need to change,
but it's not a big conceptual problem).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2023-02-21 19:58   ` Matthew Wilcox
@ 2023-02-22  2:38     ` Gao Xiang
  2023-03-02  3:17     ` David Rientjes
  2023-03-02  3:50     ` Pasha Tatashin
  2 siblings, 0 replies; 26+ messages in thread
From: Gao Xiang @ 2023-02-22  2:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme, bpf



On 2023/2/22 03:58, Matthew Wilcox wrote:
> On Wed, Feb 22, 2023 at 02:08:28AM +0800, Gao Xiang wrote:
>> On 2023/1/27 00:40, Matthew Wilcox wrote:
>>> I'd like to do another session on how the struct page dismemberment
>>> is going and what remains to be done.  Given how widely struct page is
>>> used, I think there will be interest from more than just MM, so I'd
>>> suggest a plenary session.
>>
>> I'm interested in this topic too, also I'd like to get some idea of the
>> future of the page dismemberment timeline so that I can have time to keep
>> the pace with it since some embedded use cases like Android are
>> memory-sensitive all the time.
> 
> As you all know, I'm absolutely amazing at project management & planning
> and can tell you to the day when a feature will be ready ;-)

yeah, but this core stuff actually impacts various subsystems, it would
be better to get some in advance otherwise I'm not sure if I could have
extra slots to handle these.

> 
> My goal for 2023 is to get to a point where we (a) have struct page
> reduced to:
> 
> struct page {
> 	unsigned long flags;
> 	struct list_head lru;
> 	struct address_space *mapping;
> 	pgoff_t index;
> 	unsigned long private;
> 	atomic_t _mapcount;
> 	atomic_t _refcount;
> 	unsigned long memcg_data;
> #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> 	int _last_cpupid;
> #endif
> };
> 
> and (b) can build an allnoconfig kernel with:
> 
> struct page {
> 	unsigned long flags;
> 	unsigned long padding[5];
> 	atomic_t _mapcount;
> 	atomic_t _refcount;
> 	unsigned long padding2;
> #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> 	int _last_cpupid;
> #endif
> };

Okay, with the plan above, how to make it work with memdesc in the long
term?

Also in the future at least I'd like to know if it's possible / how to
get folio itself from page and how to know if some folio is actually
truncated or connected to some (or more) inodes.

Anyway, all of the above are interesting to me, and that could avoid
some extra useless folio adoption in the opposite direction.  Also I
could have more rough thoughts how to get page cache sharing work.

I could imagine many of them may be still in the preliminary form
for now, but some detailed plans would be much helpful.

> 
>> Minor, it seems some apis still use ->lru field to chain bulk pages,
>> perhaps it needs some changes as well:
>> https://lore.kernel.org/r/20221222124412.rpnl2vojnx7izoow@techsingularity.net
>> https://lore.kernel.org/r/20230214190221.1156876-2-shy828301@gmail.com
> 
> Yang Shi covered the actual (non-)use of the list version of the bulk
> allocator already, but perhaps more importantly, each page allocated
> by the bulk allocator is actually a separately tracked allocation.
> So the obvious translation of the bulk allocator from pages to folios
> is that it allocates N order-0 folios.
> 
> That may not be the best approach for all the users of the bulk allocator,
> so we may end up doing something different.  At any rate, use of page->lru
> isn't the problem here (yes, it's something that would need to change,
> but it's not a big conceptual problem).

Yes, I just would like to confirm how to use such apis in the long term.
Currently it's no rush for me but I tend to avoid using them in a vague
direction.

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2023-02-21 19:09   ` Yang Shi
@ 2023-02-22  2:40     ` Gao Xiang
  0 siblings, 0 replies; 26+ messages in thread
From: Gao Xiang @ 2023-02-22  2:40 UTC (permalink / raw)
  To: Yang Shi, Mel Gorman
  Cc: Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, linux-block,
	linux-ide, linux-scsi, linux-nvme, bpf



On 2023/2/22 03:09, Yang Shi wrote:
> On Tue, Feb 21, 2023 at 10:08 AM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2023/1/27 00:40, Matthew Wilcox wrote:
>>> I'd like to do another session on how the struct page dismemberment
>>> is going and what remains to be done.  Given how widely struct page is
>>> used, I think there will be interest from more than just MM, so I'd
>>> suggest a plenary session.
>>>
>>> If I were hosting this session today, topics would include:
>>>
>>> Splitting out users:
>>>
>>>    - slab (done!)
>>>    - netmem (in progress)
>>>    - hugetlb (in akpm)
>>>    - tail pages (in akpm)
>>>    - page tables
>>>    - ZONE_DEVICE
>>>
>>> Users that really should have their own types:
>>>
>>>    - zsmalloc
>>>    - bootmem
>>>    - percpu
>>>    - buddy
>>>    - vmalloc
>>>
>>> Converting filesystems to folios:
>>>
>>>    - XFS (done)
>>>    - AFS (done)
>>>    - NFS (in progress)
>>>    - ext4 (in progress)
>>>    - f2fs (in progress)
>>>    - ... others?
>>>
>>> Unresolved challenges:
>>>
>>>    - mapcount
>>>    - AnonExclusive
>>>    - Splitting anon & file folios apart
>>>    - Removing PG_error & PG_private
>>
>> I'm interested in this topic too, also I'd like to get some idea of the
>> future of the page dismemberment timeline so that I can have time to keep
>> the pace with it since some embedded use cases like Android are
>> memory-sensitive all the time.
>>
>> Minor, it seems some apis still use ->lru field to chain bulk pages,
>> perhaps it needs some changes as well:
>> https://lore.kernel.org/r/20221222124412.rpnl2vojnx7izoow@techsingularity.net
>> https://lore.kernel.org/r/20230214190221.1156876-2-shy828301@gmail.com
> 
> The dm-crypt patches don't use list anymore. The bulk allocator still
> supports the list version, but so far there is no user, so it may be
> gone soon.

Thanks, it's just a detailed minor stuff relating to page->lru.  Currently
I'm no rush to evaluate/use it.

> 
>>
>> Thanks,
>> Gao Xiang
>>
>>>
>>> This will probably all change before May.
>>>
>>> I'd like to nominate Vishal Moola & Sidhartha Kumar as invitees based on
>>> their work to convert various functions from pages to folios.
>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2023-02-21 19:58   ` Matthew Wilcox
  2023-02-22  2:38     ` Gao Xiang
@ 2023-03-02  3:17     ` David Rientjes
  2023-03-02  3:50     ` Pasha Tatashin
  2 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2023-03-02  3:17 UTC (permalink / raw)
  To: Matthew Wilcox, Pasha Tatashin
  Cc: Gao Xiang, lsf-pc, linux-fsdevel, linux-mm, linux-block,
	linux-ide, linux-scsi, linux-nvme, bpf

On Tue, 21 Feb 2023, Matthew Wilcox wrote:

> On Wed, Feb 22, 2023 at 02:08:28AM +0800, Gao Xiang wrote:
> > On 2023/1/27 00:40, Matthew Wilcox wrote:
> > > I'd like to do another session on how the struct page dismemberment
> > > is going and what remains to be done.  Given how widely struct page is
> > > used, I think there will be interest from more than just MM, so I'd
> > > suggest a plenary session.
> > 
> > I'm interested in this topic too, also I'd like to get some idea of the
> > future of the page dismemberment timeline so that I can have time to keep
> > the pace with it since some embedded use cases like Android are
> > memory-sensitive all the time.
> 
> As you all know, I'm absolutely amazing at project management & planning
> and can tell you to the day when a feature will be ready ;-)
> 
> My goal for 2023 is to get to a point where we (a) have struct page
> reduced to:
> 
> struct page {
> 	unsigned long flags;
> 	struct list_head lru;
> 	struct address_space *mapping;
> 	pgoff_t index;
> 	unsigned long private;
> 	atomic_t _mapcount;
> 	atomic_t _refcount;
> 	unsigned long memcg_data;
> #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> 	int _last_cpupid;
> #endif
> };
> 
> and (b) can build an allnoconfig kernel with:
> 
> struct page {
> 	unsigned long flags;
> 	unsigned long padding[5];
> 	atomic_t _mapcount;
> 	atomic_t _refcount;
> 	unsigned long padding2;
> #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> 	int _last_cpupid;
> #endif
> };
> 

This is exciting to see and I'd definitely like to participate in the 
discussion.  Reducing struct page overhead is an important investment area 
for large hyperscalers from an efficiency standpoint, we strand a massive 
amount of memory due to struct page today.  I'd be particularly interested 
in a division-of-work discussion so that we can help to bridge any gaps 
that exist in realizing Matthew's vision, both short term and long term.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2023-02-21 19:58   ` Matthew Wilcox
  2023-02-22  2:38     ` Gao Xiang
  2023-03-02  3:17     ` David Rientjes
@ 2023-03-02  3:50     ` Pasha Tatashin
  2023-03-02  4:03       ` Matthew Wilcox
  2 siblings, 1 reply; 26+ messages in thread
From: Pasha Tatashin @ 2023-03-02  3:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gao Xiang, lsf-pc, linux-fsdevel, linux-mm, linux-block,
	linux-ide, linux-scsi, linux-nvme, bpf

On Tue, Feb 21, 2023 at 2:58 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Feb 22, 2023 at 02:08:28AM +0800, Gao Xiang wrote:
> > On 2023/1/27 00:40, Matthew Wilcox wrote:
> > > I'd like to do another session on how the struct page dismemberment
> > > is going and what remains to be done.  Given how widely struct page is
> > > used, I think there will be interest from more than just MM, so I'd
> > > suggest a plenary session.
> >
> > I'm interested in this topic too, also I'd like to get some idea of the
> > future of the page dismemberment timeline so that I can have time to keep
> > the pace with it since some embedded use cases like Android are
> > memory-sensitive all the time.
>
> As you all know, I'm absolutely amazing at project management & planning
> and can tell you to the day when a feature will be ready ;-)
>
> My goal for 2023 is to get to a point where we (a) have struct page
> reduced to:
>
> struct page {
>         unsigned long flags;
>         struct list_head lru;
>         struct address_space *mapping;
>         pgoff_t index;
>         unsigned long private;
>         atomic_t _mapcount;
>         atomic_t _refcount;
>         unsigned long memcg_data;
> #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>         int _last_cpupid;
> #endif
> };

This looks clean, but it is still 64-bytes. I wonder if we could
potentially reduce it down to 56 bytes by removing memcg_data.
Something like this might work:

1. On a 64-bit system flags field contains 19 unused bits, we could
potentially use the free bits in this field.
2. There are up-to 64K memcg ids. So in case this field contains memcg
pointer 16-bit id would be enough to convert to memcg pointer
3. In case memcg_data contains a pointer to a list of memcgs, there
could be a separate hash table data structure that contains pointers
to memcgs for slabs, or other users.

However, I am not sure how that would affect the performance, but it
would be very nice to reduce "struct page" by  8-bytes.

Pasha

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2023-03-02  3:50     ` Pasha Tatashin
@ 2023-03-02  4:03       ` Matthew Wilcox
  2023-03-02  4:16         ` Pasha Tatashin
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2023-03-02  4:03 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Gao Xiang, lsf-pc, linux-fsdevel, linux-mm, linux-block,
	linux-ide, linux-scsi, linux-nvme, bpf

On Wed, Mar 01, 2023 at 10:50:24PM -0500, Pasha Tatashin wrote:
> On Tue, Feb 21, 2023 at 2:58 PM Matthew Wilcox <willy@infradead.org> wrote:
> > My goal for 2023 is to get to a point where we (a) have struct page
> > reduced to:
> >
> > struct page {
> >         unsigned long flags;
> >         struct list_head lru;
> >         struct address_space *mapping;
> >         pgoff_t index;
> >         unsigned long private;
> >         atomic_t _mapcount;
> >         atomic_t _refcount;
> >         unsigned long memcg_data;
> > #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> >         int _last_cpupid;
> > #endif
> > };
> 
> This looks clean, but it is still 64-bytes. I wonder if we could
> potentially reduce it down to 56 bytes by removing memcg_data.

We need struct page to be 16-byte aligned to make slab work.  We also need
it to divide PAGE_SIZE evenly to make CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
work.  I don't think it's worth nibbling around the edges like this
anyway; convert everything from page to folio and then we can do the
big bang conversion where struct page shrinks from 64 bytes to 8.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2023-03-02  4:03       ` Matthew Wilcox
@ 2023-03-02  4:16         ` Pasha Tatashin
  0 siblings, 0 replies; 26+ messages in thread
From: Pasha Tatashin @ 2023-03-02  4:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gao Xiang, lsf-pc, linux-fsdevel, linux-mm, linux-block,
	linux-ide, linux-scsi, linux-nvme, bpf

On Wed, Mar 1, 2023 at 11:03 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Mar 01, 2023 at 10:50:24PM -0500, Pasha Tatashin wrote:
> > On Tue, Feb 21, 2023 at 2:58 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > My goal for 2023 is to get to a point where we (a) have struct page
> > > reduced to:
> > >
> > > struct page {
> > >         unsigned long flags;
> > >         struct list_head lru;
> > >         struct address_space *mapping;
> > >         pgoff_t index;
> > >         unsigned long private;
> > >         atomic_t _mapcount;
> > >         atomic_t _refcount;
> > >         unsigned long memcg_data;
> > > #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> > >         int _last_cpupid;
> > > #endif
> > > };
> >
> > This looks clean, but it is still 64-bytes. I wonder if we could
> > potentially reduce it down to 56 bytes by removing memcg_data.
>
> We need struct page to be 16-byte aligned to make slab work.  We also need
> it to divide PAGE_SIZE evenly to make CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP

Hm, can you please elaborate on both of these cases, how do both of
these cases work today with _last_cpuid configs or some other configs
that increase "struct page" above 64-bytes?

> work.  I don't think it's worth nibbling around the edges like this
> anyway; convert everything from page to folio and then we can do the
> big bang conversion where struct page shrinks from 64 bytes to 8.

I agree with general idea that converting to folio and shrinking
"struct page" to 8 bytes can be a big memory consumption win, but even
then we do not want to encourage the memdesc users to use larger than
needed types. If "flags" and "memcgs" are going to be part of almost
every single memdesc type it would be nice to reduce them from
16-bytes to 8-bytes.

Pasha

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-27 17:57 ` Kent Overstreet
@ 2024-01-27 18:43   ` Matthew Wilcox
  0 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox @ 2024-01-27 18:43 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme, bpf

On Sat, Jan 27, 2024 at 12:57:45PM -0500, Kent Overstreet wrote:
> On Fri, Jan 19, 2024 at 04:24:29PM +0000, Matthew Wilcox wrote:
> >  - What are we going to do about bio_vecs?
> 
> For bios and biovecs, I think it's important to keep in mind the
> distinction between the code that owns and submits the bio, and the
> consumer underneath.
> 
> The code underneath could just as easily work with pfns, and the code
> above got those pages from somewhere else, so it doesn't _need_ the bio
> for access to those pages/folios (it would be a lot of refactoring
> though).
> 
> But I've been thinking about going in a different direction - what if we
> unified iov_iter and bio? We've got ~3 different scatter-gather types
> that an IO passes through down the stack, and it would be lovely if we
> could get it down to just one; e.g. for DIO, pinning pages right at the
> copy_from_user boundary.

Yes, but ...

One of the things that Xen can do and Linux can't is I/O to/from memory
that doesn't have an associated struct page.  We have all kinds of hacks
in place to get around that right now, and I'd like to remove those.

Since we want that kind of memory (lets take, eg, GPU memory as an
example) to be mappable to userspace, and we want to be able to do DIO
to that memory, that points us to using a non-page-based structure right
from the start.  Yes, if it happens to be backed by pages we need to 'pin'
them in some way (I'd like to get away from per-page or even per-folio
pinning, but we'll see about that), but the data structure that we use
to represent that memory as it moves through the I/O subsystem needs to
be physical address based.

So my 40,000 foot view is that we do something like get_user_phyrs()
at the start of DIO, pas the phyr to the filesystem; the filesystem then
passes one or more phyrs to the block layer, the block layer gives the
phyrs to the driver which DMA maps the phyr.

Yes, the IO completion path (for buffered IO) needs to figure out which
folios are decsribed by this phyr, but that's a phys_to_folio() call away.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-19 16:24 Matthew Wilcox
                   ` (3 preceding siblings ...)
  2024-01-27 10:10 ` Amir Goldstein
@ 2024-01-27 17:57 ` Kent Overstreet
  2024-01-27 18:43   ` Matthew Wilcox
  4 siblings, 1 reply; 26+ messages in thread
From: Kent Overstreet @ 2024-01-27 17:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme, bpf

On Fri, Jan 19, 2024 at 04:24:29PM +0000, Matthew Wilcox wrote:
>  - What are we going to do about bio_vecs?

For bios and biovecs, I think it's important to keep in mind the
distinction between the code that owns and submits the bio, and the
consumer underneath.

The code underneath could just as easily work with pfns, and the code
above got those pages from somewhere else, so it doesn't _need_ the bio
for access to those pages/folios (it would be a lot of refactoring
though).

But I've been thinking about going in a different direction - what if we
unified iov_iter and bio? We've got ~3 different scatter-gather types
that an IO passes through down the stack, and it would be lovely if we
could get it down to just one; e.g. for DIO, pinning pages right at the
copy_from_user boundary.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-27 10:10 ` Amir Goldstein
@ 2024-01-27 16:18   ` Matthew Wilcox
  0 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox @ 2024-01-27 16:18 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme, bpf

On Sat, Jan 27, 2024 at 12:10:32PM +0200, Amir Goldstein wrote:
> Matthew,
> 
> And everyone else who suggests LSF/MM/BPF topic.
> 
> Please do not forget to also fill out the Google form:
> 
>           https://forms.gle/TGCgBDH1x5pXiWFo7
> 
> So we have your attendance request with suggested topics in our spreadsheet.

I'm pretty sure I already filled that out months ago within a day of the
initial announcement, and I thought I included SOTP as one of the topics.
But honestly, I'm not sure which topics I filled in there.  Is there a
way for me to know what I wrote in and edit my initial response?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-19 16:24 Matthew Wilcox
                   ` (2 preceding siblings ...)
  2024-01-21 21:00 ` David Rientjes
@ 2024-01-27 10:10 ` Amir Goldstein
  2024-01-27 16:18   ` Matthew Wilcox
  2024-01-27 17:57 ` Kent Overstreet
  4 siblings, 1 reply; 26+ messages in thread
From: Amir Goldstein @ 2024-01-27 10:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme, bpf

On Fri, Jan 19, 2024 at 6:24 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> It's probably worth doing another roundup of where we are on our journey
> to separating folios, slabs, pages, etc.  Something suitable for people
> who aren't MM experts, and don't care about the details of how page
> allocation works.  I can talk for hours about whatever people want to
> hear about but some ideas from me:
>
>  - Overview of how the conversion is going
>  - Convenience functions for filesystem writers
>  - What's next?
>  - What's the difference between &folio->page and page_folio(folio, 0)?
>  - What are we going to do about bio_vecs?
>  - How does all of this work with kmap()?
>
> I'm sure people would like to suggest other questions they have that
> aren't adequately answered already and might be of interest to a wider
> audience.
>

Matthew,

And everyone else who suggests LSF/MM/BPF topic.

Please do not forget to also fill out the Google form:

          https://forms.gle/TGCgBDH1x5pXiWFo7

So we have your attendance request with suggested topics in our spreadsheet.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-24 17:55       ` Matthew Wilcox
@ 2024-01-24 19:05         ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-01-24 19:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Rientjes, Pasha Tatashin, Sourav Panda, lsf-pc,
	linux-fsdevel, linux-mm, linux-block, linux-ide, linux-scsi,
	linux-nvme, bpf

On Wed, 24 Jan 2024, Matthew Wilcox wrote:

> On Wed, Jan 24, 2024 at 09:51:02AM -0800, Christoph Lameter (Ampere) wrote:
>> Can we come up with a design that uses a huge page (or some arbitrary page
>> size) and the breaks out portions of the large page? That way potentially
>> TLB use can be reduced (multiple sections of a large page use the same TLB)
>> and defragmentation occurs because allocs and frees focus on a selection of
>> large memory sections.
>
> Could I trouble you to reply in this thread:
> https://lore.kernel.org/linux-mm/Za6RXtSE_TSdrRm_@casper.infradead.org/
> where I actually outline what I think we should do next.

Ahh... ok. Will do.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-24 17:51     ` Christoph Lameter (Ampere)
@ 2024-01-24 17:55       ` Matthew Wilcox
  2024-01-24 19:05         ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2024-01-24 17:55 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: David Rientjes, Pasha Tatashin, Sourav Panda, lsf-pc,
	linux-fsdevel, linux-mm, linux-block, linux-ide, linux-scsi,
	linux-nvme, bpf

On Wed, Jan 24, 2024 at 09:51:02AM -0800, Christoph Lameter (Ampere) wrote:
> Can we come up with a design that uses a huge page (or some arbitrary page
> size) and the breaks out portions of the large page? That way potentially
> TLB use can be reduced (multiple sections of a large page use the same TLB)
> and defragmentation occurs because allocs and frees focus on a selection of
> large memory sections.

Could I trouble you to reply in this thread:
https://lore.kernel.org/linux-mm/Za6RXtSE_TSdrRm_@casper.infradead.org/
where I actually outline what I think we should do next.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-21 23:14   ` Matthew Wilcox
  2024-01-21 23:31     ` Pasha Tatashin
@ 2024-01-24 17:51     ` Christoph Lameter (Ampere)
  2024-01-24 17:55       ` Matthew Wilcox
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-01-24 17:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Rientjes, Pasha Tatashin, Sourav Panda, lsf-pc,
	linux-fsdevel, linux-mm, linux-block, linux-ide, linux-scsi,
	linux-nvme, bpf

On Sun, 21 Jan 2024, Matthew Wilcox wrote:

>
> I'd like to keep this topic relevant to as many people as possible.
> I can add a proposal for a topic on both the PCP and Buddy allocators
> (I have a series of Thoughts on how the PCP allocator works in a memdesc
> world that I haven't written down & sent out yet).

Well the PCP cache's  (I would not call it an allocator) intent is to 
provide cache hot / tlb hot pages. In some ways this is like the SLAB/SLUB 
situation. I.e. lists of objects vs. service objects that are 
locally related.

Can we come up with a design that uses a huge page (or some 
arbitrary page size) and the breaks out portions of the large page? That 
way potentially TLB use can be reduced (multiple sections of a large page 
use the same TLB) and defragmentation occurs because allocs and frees 
focus on a selection of large memory sections.

This is rougly equivalent to a per cpu page (folio?) in SLUB where cache 
hot objects can be served from a single memory section and also freed back 
without too much interaction with higher level more expensive components 
of the allocator.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-21 23:54       ` Matthew Wilcox
@ 2024-01-22  0:18         ` Pasha Tatashin
  0 siblings, 0 replies; 26+ messages in thread
From: Pasha Tatashin @ 2024-01-22  0:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Rientjes, Pasha Tatashin, Sourav Panda, lsf-pc,
	linux-fsdevel, linux-mm, linux-block, linux-ide, linux-scsi,
	linux-nvme, bpf

On Sun, Jan 21, 2024 at 6:54 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Jan 21, 2024 at 06:31:48PM -0500, Pasha Tatashin wrote:
> > On Sun, Jan 21, 2024 at 6:14 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > I can add a proposal for a topic on both the PCP and Buddy allocators
> > > (I have a series of Thoughts on how the PCP allocator works in a memdesc
> > > world that I haven't written down & sent out yet).
> >
> > Interesting, given that pcp are mostly allocated by kmalloc and use
> > vmalloc for large allocations, how memdesc can be different for them
> > compared to regular kmalloc allocations given that they are sub-page?
>
> Oh!  I don't mean the mm/percpu.c allocator.  I mean the pcp allocator
> in mm/page_alloc.c.

Nevermind, this makes perfect sense now :-)

> I don't have any Thoughts on mm/percpu.c at this time.  I'm vaguely
> aware that it exists ;-)
>
> > > Thee's so much work to be done!  And it's mostly parallelisable and almost
> > > trivial.  It's just largely on the filesystem-page cache interaction, so
> > > it's not terribly interesting.  See, for example, the ext2, ext4, gfs2,
> > > nilfs2, ufs and ubifs patchsets I've done over the past few releases.
> > > I have about half of an ntfs3 patchset ready to send.
> >
> > > There's a bunch of work to be done in DRM to switch from pages to folios
> > > due to their use of shmem.  You can also grep for 'page->mapping' (because
> > > fortunately we aren't too imaginative when it comes to naming variables)
> > > and find 270 places that need to be changed.  Some are comments, but
> > > those still need to be updated!
> > >
> > > Anything using lock_page(), get_page(), set_page_dirty(), using
> > > &folio->page, any of the functions in mm/folio-compat.c needs auditing.
> > > We can make the first three of those work, but they're good indicators
> > > that the code needs to be looked at.
> > >
> > > There is some interesting work to be done, and one of the things I'm
> > > thinking hard about right now is how we're doing folio conversions
> > > that make sense with today's code, and stop making sense when we get
> > > to memdescs.  That doesn't apply to anything interacting with the page
> > > cache (because those are folios now and in the future), but it does apply
> > > to one spot in ext4 where it allocates memory from slab and attaches a
> > > buffer_head to it ...
> >
> > There are many more drivers that would need the conversion. For
> > example, IOMMU page tables can occupy gigabytes of space, have
> > different implementations for AMD, X86, and several ARMs. Conversion
> > to memdesc and unifying the IO page table management implementation
> > for these platforms would be beneficial.
>
> Understood; there's a lot of code that can benefit from larger
> allocations.  I was listing the impediments to shrinking struct page
> rather than the places which would most benefit from switching to larger
> allocations.  They're complementary to a large extent; you can switch
> to compound allocations today and get the benefit later.  And unifying
> implementations is always a worthy project.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-21 23:31     ` Pasha Tatashin
@ 2024-01-21 23:54       ` Matthew Wilcox
  2024-01-22  0:18         ` Pasha Tatashin
  0 siblings, 1 reply; 26+ messages in thread
From: Matthew Wilcox @ 2024-01-21 23:54 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: David Rientjes, Pasha Tatashin, Sourav Panda, lsf-pc,
	linux-fsdevel, linux-mm, linux-block, linux-ide, linux-scsi,
	linux-nvme, bpf

On Sun, Jan 21, 2024 at 06:31:48PM -0500, Pasha Tatashin wrote:
> On Sun, Jan 21, 2024 at 6:14 PM Matthew Wilcox <willy@infradead.org> wrote:
> > I can add a proposal for a topic on both the PCP and Buddy allocators
> > (I have a series of Thoughts on how the PCP allocator works in a memdesc
> > world that I haven't written down & sent out yet).
> 
> Interesting, given that pcp are mostly allocated by kmalloc and use
> vmalloc for large allocations, how memdesc can be different for them
> compared to regular kmalloc allocations given that they are sub-page?

Oh!  I don't mean the mm/percpu.c allocator.  I mean the pcp allocator
in mm/page_alloc.c.

I don't have any Thoughts on mm/percpu.c at this time.  I'm vaguely
aware that it exists ;-)

> > Thee's so much work to be done!  And it's mostly parallelisable and almost
> > trivial.  It's just largely on the filesystem-page cache interaction, so
> > it's not terribly interesting.  See, for example, the ext2, ext4, gfs2,
> > nilfs2, ufs and ubifs patchsets I've done over the past few releases.
> > I have about half of an ntfs3 patchset ready to send.
> 
> > There's a bunch of work to be done in DRM to switch from pages to folios
> > due to their use of shmem.  You can also grep for 'page->mapping' (because
> > fortunately we aren't too imaginative when it comes to naming variables)
> > and find 270 places that need to be changed.  Some are comments, but
> > those still need to be updated!
> >
> > Anything using lock_page(), get_page(), set_page_dirty(), using
> > &folio->page, any of the functions in mm/folio-compat.c needs auditing.
> > We can make the first three of those work, but they're good indicators
> > that the code needs to be looked at.
> >
> > There is some interesting work to be done, and one of the things I'm
> > thinking hard about right now is how we're doing folio conversions
> > that make sense with today's code, and stop making sense when we get
> > to memdescs.  That doesn't apply to anything interacting with the page
> > cache (because those are folios now and in the future), but it does apply
> > to one spot in ext4 where it allocates memory from slab and attaches a
> > buffer_head to it ...
> 
> There are many more drivers that would need the conversion. For
> example, IOMMU page tables can occupy gigabytes of space, have
> different implementations for AMD, X86, and several ARMs. Conversion
> to memdesc and unifying the IO page table management implementation
> for these platforms would be beneficial.

Understood; there's a lot of code that can benefit from larger
allocations.  I was listing the impediments to shrinking struct page
rather than the places which would most benefit from switching to larger
allocations.  They're complementary to a large extent; you can switch
to compound allocations today and get the benefit later.  And unifying
implementations is always a worthy project.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-21 23:14   ` Matthew Wilcox
@ 2024-01-21 23:31     ` Pasha Tatashin
  2024-01-21 23:54       ` Matthew Wilcox
  2024-01-24 17:51     ` Christoph Lameter (Ampere)
  1 sibling, 1 reply; 26+ messages in thread
From: Pasha Tatashin @ 2024-01-21 23:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Rientjes, Pasha Tatashin, Sourav Panda, lsf-pc,
	linux-fsdevel, linux-mm, linux-block, linux-ide, linux-scsi,
	linux-nvme, bpf

Hi Matthew,

Thank you for proposing this topic. I would also like to be part of
this discussion at LSF/MM specifically because of the memory
efficiency opportunities coming of memdescs.

On Sun, Jan 21, 2024 at 6:14 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Jan 21, 2024 at 01:00:40PM -0800, David Rientjes wrote:
> > On Fri, 19 Jan 2024, Matthew Wilcox wrote:
> > > It's probably worth doing another roundup of where we are on our journey
> > > to separating folios, slabs, pages, etc.  Something suitable for people
> > > who aren't MM experts, and don't care about the details of how page
> > > allocation works.  I can talk for hours about whatever people want to
> > > hear about but some ideas from me:
> > >
> > >  - Overview of how the conversion is going
> > >  - Convenience functions for filesystem writers
> > >  - What's next?
> > >  - What's the difference between &folio->page and page_folio(folio, 0)?
> > >  - What are we going to do about bio_vecs?
> > >  - How does all of this work with kmap()?
> > >
> > > I'm sure people would like to suggest other questions they have that
> > > aren't adequately answered already and might be of interest to a wider
> > > audience.
> > >
> >
> > Thanks for proposing this again, Matthew, I'd definitely like to be
> > involved in the discussion as I think a couple of my colleagues, cc'd,
> > would has well.  Memory efficiency is a top priority for 2024 and, thus,
> > getting on a pathway toward reducing the overhead of struct page is very
> > important for our hosts that are not using large amounts of 1GB hugetlb.
> >
> > I've seen your other thread regarding how the page allocator can be
> > enlightened for memdesc, so I'm hoping that can either be covered in this
> > topic or a separate topic.
>
> I'd like to keep this topic relevant to as many people as possible.
> I can add a proposal for a topic on both the PCP and Buddy allocators
> (I have a series of Thoughts on how the PCP allocator works in a memdesc
> world that I haven't written down & sent out yet).

Interesting, given that pcp are mostly allocated by kmalloc and use
vmalloc for large allocations, how memdesc can be different for them
compared to regular kmalloc allocations given that they are sub-page?

> Or we can cover the page allocators in your biweekly meetings.  Maybe both
> since not everybody can attend either the phone call or the conference.
>
> > Especially important for us would be the division of work so that we can
> > parallelize development as much as possible for things like memdesc.  If
> > there are any areas that just haven't been investigated yet but we *know*
> > we'll need to address to get to the new world of memdesc, I think we'd
> > love to discuss that.
>
> Thee's so much work to be done!  And it's mostly parallelisable and almost
> trivial.  It's just largely on the filesystem-page cache interaction, so
> it's not terribly interesting.  See, for example, the ext2, ext4, gfs2,
> nilfs2, ufs and ubifs patchsets I've done over the past few releases.
> I have about half of an ntfs3 patchset ready to send.

> There's a bunch of work to be done in DRM to switch from pages to folios
> due to their use of shmem.  You can also grep for 'page->mapping' (because
> fortunately we aren't too imaginative when it comes to naming variables)
> and find 270 places that need to be changed.  Some are comments, but
> those still need to be updated!
>
> Anything using lock_page(), get_page(), set_page_dirty(), using
> &folio->page, any of the functions in mm/folio-compat.c needs auditing.
> We can make the first three of those work, but they're good indicators
> that the code needs to be looked at.
>
> There is some interesting work to be done, and one of the things I'm
> thinking hard about right now is how we're doing folio conversions
> that make sense with today's code, and stop making sense when we get
> to memdescs.  That doesn't apply to anything interacting with the page
> cache (because those are folios now and in the future), but it does apply
> to one spot in ext4 where it allocates memory from slab and attaches a
> buffer_head to it ...

There are many more drivers that would need the conversion. For
example, IOMMU page tables can occupy gigabytes of space, have
different implementations for AMD, X86, and several ARMs. Conversion
to memdesc and unifying the IO page table management implementation
for these platforms would be beneficial.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-21 21:00 ` David Rientjes
@ 2024-01-21 23:14   ` Matthew Wilcox
  2024-01-21 23:31     ` Pasha Tatashin
  2024-01-24 17:51     ` Christoph Lameter (Ampere)
  0 siblings, 2 replies; 26+ messages in thread
From: Matthew Wilcox @ 2024-01-21 23:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pasha Tatashin, Sourav Panda, lsf-pc, linux-fsdevel, linux-mm,
	linux-block, linux-ide, linux-scsi, linux-nvme, bpf

On Sun, Jan 21, 2024 at 01:00:40PM -0800, David Rientjes wrote:
> On Fri, 19 Jan 2024, Matthew Wilcox wrote:
> > It's probably worth doing another roundup of where we are on our journey
> > to separating folios, slabs, pages, etc.  Something suitable for people
> > who aren't MM experts, and don't care about the details of how page
> > allocation works.  I can talk for hours about whatever people want to
> > hear about but some ideas from me:
> > 
> >  - Overview of how the conversion is going
> >  - Convenience functions for filesystem writers
> >  - What's next?
> >  - What's the difference between &folio->page and page_folio(folio, 0)?
> >  - What are we going to do about bio_vecs?
> >  - How does all of this work with kmap()?
> > 
> > I'm sure people would like to suggest other questions they have that
> > aren't adequately answered already and might be of interest to a wider
> > audience.
> > 
> 
> Thanks for proposing this again, Matthew, I'd definitely like to be 
> involved in the discussion as I think a couple of my colleagues, cc'd, 
> would has well.  Memory efficiency is a top priority for 2024 and, thus, 
> getting on a pathway toward reducing the overhead of struct page is very 
> important for our hosts that are not using large amounts of 1GB hugetlb.
> 
> I've seen your other thread regarding how the page allocator can be 
> enlightened for memdesc, so I'm hoping that can either be covered in this 
> topic or a separate topic.

I'd like to keep this topic relevant to as many people as possible.
I can add a proposal for a topic on both the PCP and Buddy allocators
(I have a series of Thoughts on how the PCP allocator works in a memdesc
world that I haven't written down & sent out yet).

Or we can cover the page allocators in your biweekly meetings.  Maybe both
since not everybody can attend either the phone call or the conference.

> Especially important for us would be the division of work so that we can 
> parallelize development as much as possible for things like memdesc.  If 
> there are any areas that just haven't been investigated yet but we *know* 
> we'll need to address to get to the new world of memdesc, I think we'd 
> love to discuss that.

Thee's so much work to be done!  And it's mostly parallelisable and almost
trivial.  It's just largely on the filesystem-page cache interaction, so
it's not terribly interesting.  See, for example, the ext2, ext4, gfs2,
nilfs2, ufs and ubifs patchsets I've done over the past few releases.
I have about half of an ntfs3 patchset ready to send.

There's a bunch of work to be done in DRM to switch from pages to folios
due to their use of shmem.  You can also grep for 'page->mapping' (because
fortunately we aren't too imaginative when it comes to naming variables)
and find 270 places that need to be changed.  Some are comments, but
those still need to be updated!

Anything using lock_page(), get_page(), set_page_dirty(), using
&folio->page, any of the functions in mm/folio-compat.c needs auditing.
We can make the first three of those work, but they're good indicators
that the code needs to be looked at.

There is some interesting work to be done, and one of the things I'm
thinking hard about right now is how we're doing folio conversions
that make sense with today's code, and stop making sense when we get
to memdescs.  That doesn't apply to anything interacting with the page
cache (because those are folios now and in the future), but it does apply
to one spot in ext4 where it allocates memory from slab and attaches a
buffer_head to it ...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-19 16:24 Matthew Wilcox
  2024-01-19 20:31 ` Keith Busch
  2024-01-20 14:11 ` Chuck Lever III
@ 2024-01-21 21:00 ` David Rientjes
  2024-01-21 23:14   ` Matthew Wilcox
  2024-01-27 10:10 ` Amir Goldstein
  2024-01-27 17:57 ` Kent Overstreet
  4 siblings, 1 reply; 26+ messages in thread
From: David Rientjes @ 2024-01-21 21:00 UTC (permalink / raw)
  To: Matthew Wilcox, Pasha Tatashin, Sourav Panda
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme, bpf

On Fri, 19 Jan 2024, Matthew Wilcox wrote:

> It's probably worth doing another roundup of where we are on our journey
> to separating folios, slabs, pages, etc.  Something suitable for people
> who aren't MM experts, and don't care about the details of how page
> allocation works.  I can talk for hours about whatever people want to
> hear about but some ideas from me:
> 
>  - Overview of how the conversion is going
>  - Convenience functions for filesystem writers
>  - What's next?
>  - What's the difference between &folio->page and page_folio(folio, 0)?
>  - What are we going to do about bio_vecs?
>  - How does all of this work with kmap()?
> 
> I'm sure people would like to suggest other questions they have that
> aren't adequately answered already and might be of interest to a wider
> audience.
> 

Thanks for proposing this again, Matthew, I'd definitely like to be 
involved in the discussion as I think a couple of my colleagues, cc'd, 
would has well.  Memory efficiency is a top priority for 2024 and, thus, 
getting on a pathway toward reducing the overhead of struct page is very 
important for our hosts that are not using large amounts of 1GB hugetlb.

I've seen your other thread regarding how the page allocator can be 
enlightened for memdesc, so I'm hoping that can either be covered in this 
topic or a separate topic.

Especially important for us would be the division of work so that we can 
parallelize development as much as possible for things like memdesc.  If 
there are any areas that just haven't been investigated yet but we *know* 
we'll need to address to get to the new world of memdesc, I think we'd 
love to discuss that.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-19 16:24 Matthew Wilcox
  2024-01-19 20:31 ` Keith Busch
@ 2024-01-20 14:11 ` Chuck Lever III
  2024-01-21 21:00 ` David Rientjes
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Chuck Lever III @ 2024-01-20 14:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme, bpf



> On Jan 19, 2024, at 11:24 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> It's probably worth doing another roundup of where we are on our journey
> to separating folios, slabs, pages, etc.  Something suitable for people
> who aren't MM experts, and don't care about the details of how page
> allocation works.  I can talk for hours about whatever people want to
> hear about but some ideas from me:
> 
> - Overview of how the conversion is going
> - Convenience functions for filesystem writers
> - What's next?
> - What's the difference between &folio->page and page_folio(folio, 0)?
> - What are we going to do about bio_vecs?

Thanks for suggesting bio_vecs, one of my current favorite topics.

I'm still interested in any work going on for a bio_vec-enabled
API to the kernel's IOMMU infrastructure, which I've heard Leon R.
is working on.


> - How does all of this work with kmap()?
> 
> I'm sure people would like to suggest other questions they have that
> aren't adequately answered already and might be of interest to a wider
> audience.
> 

--
Chuck Lever



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] State Of The Page
  2024-01-19 16:24 Matthew Wilcox
@ 2024-01-19 20:31 ` Keith Busch
  2024-01-20 14:11 ` Chuck Lever III
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Keith Busch @ 2024-01-19 20:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: lsf-pc, linux-fsdevel, linux-mm, linux-block, linux-ide,
	linux-scsi, linux-nvme, bpf

On Fri, Jan 19, 2024 at 04:24:29PM +0000, Matthew Wilcox wrote:
> It's probably worth doing another roundup of where we are on our journey
> to separating folios, slabs, pages, etc.  Something suitable for people
> who aren't MM experts, and don't care about the details of how page
> allocation works.  I can talk for hours about whatever people want to
> hear about but some ideas from me:
> 
>  - Overview of how the conversion is going
>  - Convenience functions for filesystem writers
>  - What's next?
>  - What's the difference between &folio->page and page_folio(folio, 0)?
>  - What are we going to do about bio_vecs?
>  - How does all of this work with kmap()?
> 
> I'm sure people would like to suggest other questions they have that
> aren't adequately answered already and might be of interest to a wider
> audience.

Thanks for suggesting this, I would like to attend your discussion. If
you have more recent phyr thoughts (possibly related to your bio_vecs
point?), or other tie-ins to large block size support, that would also
be great.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [LSF/MM/BPF TOPIC] State Of The Page
@ 2024-01-19 16:24 Matthew Wilcox
  2024-01-19 20:31 ` Keith Busch
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Matthew Wilcox @ 2024-01-19 16:24 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, linux-mm, linux-block, linux-ide, linux-scsi,
	linux-nvme, bpf

It's probably worth doing another roundup of where we are on our journey
to separating folios, slabs, pages, etc.  Something suitable for people
who aren't MM experts, and don't care about the details of how page
allocation works.  I can talk for hours about whatever people want to
hear about but some ideas from me:

 - Overview of how the conversion is going
 - Convenience functions for filesystem writers
 - What's next?
 - What's the difference between &folio->page and page_folio(folio, 0)?
 - What are we going to do about bio_vecs?
 - How does all of this work with kmap()?

I'm sure people would like to suggest other questions they have that
aren't adequately answered already and might be of interest to a wider
audience.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-01-27 18:43 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-26 16:40 [LSF/MM/BPF TOPIC] State Of The Page Matthew Wilcox
2023-02-21 16:57 ` David Howells
2023-02-21 18:08 ` Gao Xiang
2023-02-21 19:09   ` Yang Shi
2023-02-22  2:40     ` Gao Xiang
2023-02-21 19:58   ` Matthew Wilcox
2023-02-22  2:38     ` Gao Xiang
2023-03-02  3:17     ` David Rientjes
2023-03-02  3:50     ` Pasha Tatashin
2023-03-02  4:03       ` Matthew Wilcox
2023-03-02  4:16         ` Pasha Tatashin
2024-01-19 16:24 Matthew Wilcox
2024-01-19 20:31 ` Keith Busch
2024-01-20 14:11 ` Chuck Lever III
2024-01-21 21:00 ` David Rientjes
2024-01-21 23:14   ` Matthew Wilcox
2024-01-21 23:31     ` Pasha Tatashin
2024-01-21 23:54       ` Matthew Wilcox
2024-01-22  0:18         ` Pasha Tatashin
2024-01-24 17:51     ` Christoph Lameter (Ampere)
2024-01-24 17:55       ` Matthew Wilcox
2024-01-24 19:05         ` Christoph Lameter (Ampere)
2024-01-27 10:10 ` Amir Goldstein
2024-01-27 16:18   ` Matthew Wilcox
2024-01-27 17:57 ` Kent Overstreet
2024-01-27 18:43   ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).