All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Single Owner Memory
@ 2023-02-20 19:10 Pasha Tatashin
  2023-02-21 13:46 ` Matthew Wilcox
  2023-02-21 15:55 ` Matthew Wilcox
  0 siblings, 2 replies; 9+ messages in thread
From: Pasha Tatashin @ 2023-02-20 19:10 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm

Hello,

I would like to propose the following topic for this year's LSF/MM/BPF:

Single Owner Memory (som): A type of anonymous memory that is never shared.

Within Google the vast majority of memory, over 90% has a single
owner. This is because most of the jobs are not multi-process but
instead multi-threaded. The examples of single owner memory
allocations are all tcmalloc()/malloc() allocations, and
mmap(MAP_ANONYMOUS | MAP_PRIVATE) allocations without forks. On the
other hand, the struct page metadata that is shared for all types of
memory takes 1.6% of the system memory. It would be reasonable to find
ways to optimize memory such that the common som case has a reduced
amount of metadata.

This would be similar to HugeTLB and DAX that are treated as special
cases, and can release struct pages for the subpages back to the
system.

The proposal is to discuss a new som driver that would use HugeTLB as
a source of 2M chunks. When user creates a som memory, i.e.:

mmap(MAP_ANONYMOUS | MAP_PRIVATE);
madvise(mem, length, MADV_DONTFORK);

A vma from the som driver is used instead of regular anon vma.

The discussion should include the following topics:
-  Interaction with folio and the proposed struct page {memdesc}.
- Handling for migrate_pages() and friends.
- Handling for FOLL_PIN and FOLL_LONGTERM.
- What type of madvise() properties the som memory should handle

Thanks,
Pasha


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Single Owner Memory
  2023-02-20 19:10 [LSF/MM/BPF TOPIC] Single Owner Memory Pasha Tatashin
@ 2023-02-21 13:46 ` Matthew Wilcox
  2023-02-21 14:37   ` Pasha Tatashin
  2023-02-21 15:55 ` Matthew Wilcox
  1 sibling, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2023-02-21 13:46 UTC (permalink / raw)
  To: Pasha Tatashin; +Cc: lsf-pc, linux-mm

On Mon, Feb 20, 2023 at 02:10:24PM -0500, Pasha Tatashin wrote:
> Within Google the vast majority of memory, over 90% has a single
> owner. This is because most of the jobs are not multi-process but
> instead multi-threaded. The examples of single owner memory
> allocations are all tcmalloc()/malloc() allocations, and
> mmap(MAP_ANONYMOUS | MAP_PRIVATE) allocations without forks. On the
> other hand, the struct page metadata that is shared for all types of
> memory takes 1.6% of the system memory. It would be reasonable to find
> ways to optimize memory such that the common som case has a reduced
> amount of metadata.
> 
> This would be similar to HugeTLB and DAX that are treated as special
> cases, and can release struct pages for the subpages back to the
> system.

DAX can't, unless something's changed recently.  You're referring to
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP

> The proposal is to discuss a new som driver that would use HugeTLB as
> a source of 2M chunks. When user creates a som memory, i.e.:
> 
> mmap(MAP_ANONYMOUS | MAP_PRIVATE);
> madvise(mem, length, MADV_DONTFORK);
> 
> A vma from the som driver is used instead of regular anon vma.

That's going to be "interesting".  The VMA is already created with
the call to mmap(), and madvise has not traditionally allowed drivers
to replace a VMA.  You might be better off creating a /dev/som and
hacking the malloc libraries to pass an fd from that instead of passing
MAP_ANONYMOUS.

> The discussion should include the following topics:
> -  Interaction with folio and the proposed struct page {memdesc}.
> - Handling for migrate_pages() and friends.
> - Handling for FOLL_PIN and FOLL_LONGTERM.
> - What type of madvise() properties the som memory should handle

Obviously once we get to dynamically allocated memdescs, this whole
thing goes away, so I'm not excited about making big changes to the
kernel to support this.

The savings you'll see are 6 pages (24kB) per 2MB allocated (1.2%).
That's not nothing, but it's not huge either.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Single Owner Memory
  2023-02-21 13:46 ` Matthew Wilcox
@ 2023-02-21 14:37   ` Pasha Tatashin
  2023-02-21 15:05     ` Matthew Wilcox
  0 siblings, 1 reply; 9+ messages in thread
From: Pasha Tatashin @ 2023-02-21 14:37 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm

Hey Matthew,

Thank you for looking into this.

On Tue, Feb 21, 2023 at 8:46 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Feb 20, 2023 at 02:10:24PM -0500, Pasha Tatashin wrote:
> > Within Google the vast majority of memory, over 90% has a single
> > owner. This is because most of the jobs are not multi-process but
> > instead multi-threaded. The examples of single owner memory
> > allocations are all tcmalloc()/malloc() allocations, and
> > mmap(MAP_ANONYMOUS | MAP_PRIVATE) allocations without forks. On the
> > other hand, the struct page metadata that is shared for all types of
> > memory takes 1.6% of the system memory. It would be reasonable to find
> > ways to optimize memory such that the common som case has a reduced
> > amount of metadata.
> >
> > This would be similar to HugeTLB and DAX that are treated as special
> > cases, and can release struct pages for the subpages back to the
> > system.
>
> DAX can't, unless something's changed recently.  You're referring to
> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP

DAX has a similar optimization:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.2&id=e3246d8f52173a798710314a42fea83223036fc8

>
> > The proposal is to discuss a new som driver that would use HugeTLB as
> > a source of 2M chunks. When user creates a som memory, i.e.:
> >
> > mmap(MAP_ANONYMOUS | MAP_PRIVATE);
> > madvise(mem, length, MADV_DONTFORK);
> >
> > A vma from the som driver is used instead of regular anon vma.
>
> That's going to be "interesting".  The VMA is already created with
> the call to mmap(), and madvise has not traditionally allowed drivers
> to replace a VMA.  You might be better off creating a /dev/som and
> hacking the malloc libraries to pass an fd from that instead of passing
> MAP_ANONYMOUS.

I do not plan to replace VMA after madvise(), I showed the syscall
sequence to show how Single Owner Memory can be enforced today.
However, in the future we either need to add another mmap() flag for
single owner memory if that is proved to be important or as you
suggested  use ioctl() through /dev/som.

> > The discussion should include the following topics:
> > -  Interaction with folio and the proposed struct page {memdesc}.
> > - Handling for migrate_pages() and friends.
> > - Handling for FOLL_PIN and FOLL_LONGTERM.
> > - What type of madvise() properties the som memory should handle
>
> Obviously once we get to dynamically allocated memdescs, this whole
> thing goes away, so I'm not excited about making big changes to the
> kernel to support this.

This is why the changes that I am thinking about are going to be
mostly localized in a separate driver and do not alter the core mm
much. However, even with memdesc, today the Single Owner Memory is not
singled out from the rest of memory types (shared, anon, named), so I
do not expect the memdescs can provide saving or optimizations for
this specific use case.

> The savings you'll see are 6 pages (24kB) per 2MB allocated (1.2%).
> That's not nothing, but it's not huge either.

This depends on the scale, in our fleet 1.2% savings are huge.

Pasha


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Single Owner Memory
  2023-02-21 14:37   ` Pasha Tatashin
@ 2023-02-21 15:05     ` Matthew Wilcox
  2023-02-21 17:16       ` Pasha Tatashin
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2023-02-21 15:05 UTC (permalink / raw)
  To: Pasha Tatashin; +Cc: lsf-pc, linux-mm

On Tue, Feb 21, 2023 at 09:37:17AM -0500, Pasha Tatashin wrote:
> Hey Matthew,
> 
> Thank you for looking into this.
> 
> On Tue, Feb 21, 2023 at 8:46 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Feb 20, 2023 at 02:10:24PM -0500, Pasha Tatashin wrote:
> > > Within Google the vast majority of memory, over 90% has a single
> > > owner. This is because most of the jobs are not multi-process but
> > > instead multi-threaded. The examples of single owner memory
> > > allocations are all tcmalloc()/malloc() allocations, and
> > > mmap(MAP_ANONYMOUS | MAP_PRIVATE) allocations without forks. On the
> > > other hand, the struct page metadata that is shared for all types of
> > > memory takes 1.6% of the system memory. It would be reasonable to find
> > > ways to optimize memory such that the common som case has a reduced
> > > amount of metadata.
> > >
> > > This would be similar to HugeTLB and DAX that are treated as special
> > > cases, and can release struct pages for the subpages back to the
> > > system.
> >
> > DAX can't, unless something's changed recently.  You're referring to
> > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> 
> DAX has a similar optimization:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.2&id=e3246d8f52173a798710314a42fea83223036fc8

Oh, devdax, not fsdax.

> > > The proposal is to discuss a new som driver that would use HugeTLB as
> > > a source of 2M chunks. When user creates a som memory, i.e.:
> > >
> > > mmap(MAP_ANONYMOUS | MAP_PRIVATE);
> > > madvise(mem, length, MADV_DONTFORK);
> > >
> > > A vma from the som driver is used instead of regular anon vma.
> >
> > That's going to be "interesting".  The VMA is already created with
> > the call to mmap(), and madvise has not traditionally allowed drivers
> > to replace a VMA.  You might be better off creating a /dev/som and
> > hacking the malloc libraries to pass an fd from that instead of passing
> > MAP_ANONYMOUS.
> 
> I do not plan to replace VMA after madvise(), I showed the syscall
> sequence to show how Single Owner Memory can be enforced today.
> However, in the future we either need to add another mmap() flag for
> single owner memory if that is proved to be important or as you
> suggested  use ioctl() through /dev/som.

Not ioctl().  Pass an fd from /dev/som to mmap and have the som driver
set up the VMA.

> > > The discussion should include the following topics:
> > > -  Interaction with folio and the proposed struct page {memdesc}.
> > > - Handling for migrate_pages() and friends.
> > > - Handling for FOLL_PIN and FOLL_LONGTERM.
> > > - What type of madvise() properties the som memory should handle
> >
> > Obviously once we get to dynamically allocated memdescs, this whole
> > thing goes away, so I'm not excited about making big changes to the
> > kernel to support this.
> 
> This is why the changes that I am thinking about are going to be
> mostly localized in a separate driver and do not alter the core mm
> much. However, even with memdesc, today the Single Owner Memory is not
> singled out from the rest of memory types (shared, anon, named), so I
> do not expect the memdescs can provide saving or optimizations for
> this specific use case.

With memdescs, let's suppose the malloc library asks for a 256kB
allocation.  You end up using 8 bytes per page for the memdesc pointer
(512 bytes) plus around 96 bytes for the folio that's used by the anon
memory (assuming appropriate hinting / heuristics that says "Hey, treat
this as a single allocation").  So that's 608 bytes of overhead for a
256kB allocation, or 0.23% overhead.  About half the overhead of 8kB
per 2MB (plus whatever overhead the SOM driver has to track the 256kB
of memory).

If 256kB isn't the right size to be doing this kind of analysis on, we
can rerun it on whatever size you want.  I'm not really familiar with
what userspace is doing these days.

> > The savings you'll see are 6 pages (24kB) per 2MB allocated (1.2%).
> > That's not nothing, but it's not huge either.
> 
> This depends on the scale, in our fleet 1.2% savings are huge.

Then 1.4% will be better, yes?  ;-)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Single Owner Memory
  2023-02-20 19:10 [LSF/MM/BPF TOPIC] Single Owner Memory Pasha Tatashin
  2023-02-21 13:46 ` Matthew Wilcox
@ 2023-02-21 15:55 ` Matthew Wilcox
  2023-02-21 17:20   ` Pasha Tatashin
  1 sibling, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2023-02-21 15:55 UTC (permalink / raw)
  To: Pasha Tatashin; +Cc: lsf-pc, linux-mm

On Mon, Feb 20, 2023 at 02:10:24PM -0500, Pasha Tatashin wrote:
> The discussion should include the following topics:
> -  Interaction with folio and the proposed struct page {memdesc}.
> - Handling for migrate_pages() and friends.
> - Handling for FOLL_PIN and FOLL_LONGTERM.
> - What type of madvise() properties the som memory should handle

Something I didn't see covered was how you'd want to handle memory
pressure.  The answer for memdescs is that we'd treat each userspace
allocation as a single object; if you allocate a 256kB folio, that has
one accessed bit (set every time any of the PTEs which reference that
folio is accessed), one dirty bit, is aged on the LRU as a single unit
and will be written to swap as a single unit.

Assuming we're dealing with objects smaller than PMDs, we have a number
of PTEs each of which has its own A and D bits, so we can determine
at each revolution of the LRU clock whether it still makes sense to be
treating the folio as a single unit, or whether pages in the first half
of the folio are no longer being accessed and we should split the folio
in half and age the two halves separately.

All of that is still theoretical; we don't allocate anon memory in sizes
other than PAGE_SIZE and PMD size.  And we don't track page cache A and
D bits to see whether the decision to allocate a particular page size was
the right one (most page cache memory is never mapped into userspace, so
it might be of limited value, but I'm sure we could track the equivalent
information with read() and write()).


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Single Owner Memory
  2023-02-21 15:05     ` Matthew Wilcox
@ 2023-02-21 17:16       ` Pasha Tatashin
  2023-02-22 16:18         ` Matthew Wilcox
  0 siblings, 1 reply; 9+ messages in thread
From: Pasha Tatashin @ 2023-02-21 17:16 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm

On Tue, Feb 21, 2023 at 10:05 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 21, 2023 at 09:37:17AM -0500, Pasha Tatashin wrote:
> > Hey Matthew,
> >
> > Thank you for looking into this.
> >
> > On Tue, Feb 21, 2023 at 8:46 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Feb 20, 2023 at 02:10:24PM -0500, Pasha Tatashin wrote:
> > > > Within Google the vast majority of memory, over 90% has a single
> > > > owner. This is because most of the jobs are not multi-process but
> > > > instead multi-threaded. The examples of single owner memory
> > > > allocations are all tcmalloc()/malloc() allocations, and
> > > > mmap(MAP_ANONYMOUS | MAP_PRIVATE) allocations without forks. On the
> > > > other hand, the struct page metadata that is shared for all types of
> > > > memory takes 1.6% of the system memory. It would be reasonable to find
> > > > ways to optimize memory such that the common som case has a reduced
> > > > amount of metadata.
> > > >
> > > > This would be similar to HugeTLB and DAX that are treated as special
> > > > cases, and can release struct pages for the subpages back to the
> > > > system.
> > >
> > > DAX can't, unless something's changed recently.  You're referring to
> > > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> >
> > DAX has a similar optimization:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.2&id=e3246d8f52173a798710314a42fea83223036fc8
>
> Oh, devdax, not fsdax.
>
> > > > The proposal is to discuss a new som driver that would use HugeTLB as
> > > > a source of 2M chunks. When user creates a som memory, i.e.:
> > > >
> > > > mmap(MAP_ANONYMOUS | MAP_PRIVATE);
> > > > madvise(mem, length, MADV_DONTFORK);
> > > >
> > > > A vma from the som driver is used instead of regular anon vma.
> > >
> > > That's going to be "interesting".  The VMA is already created with
> > > the call to mmap(), and madvise has not traditionally allowed drivers
> > > to replace a VMA.  You might be better off creating a /dev/som and
> > > hacking the malloc libraries to pass an fd from that instead of passing
> > > MAP_ANONYMOUS.
> >
> > I do not plan to replace VMA after madvise(), I showed the syscall
> > sequence to show how Single Owner Memory can be enforced today.
> > However, in the future we either need to add another mmap() flag for
> > single owner memory if that is proved to be important or as you
> > suggested  use ioctl() through /dev/som.
>
> Not ioctl().  Pass an fd from /dev/som to mmap and have the som driver
> set up the VMA.

Good point, using fd is indeed better, and can be made accessible to
more users without changes.

>
> > > > The discussion should include the following topics:
> > > > -  Interaction with folio and the proposed struct page {memdesc}.
> > > > - Handling for migrate_pages() and friends.
> > > > - Handling for FOLL_PIN and FOLL_LONGTERM.
> > > > - What type of madvise() properties the som memory should handle
> > >
> > > Obviously once we get to dynamically allocated memdescs, this whole
> > > thing goes away, so I'm not excited about making big changes to the
> > > kernel to support this.
> >
> > This is why the changes that I am thinking about are going to be
> > mostly localized in a separate driver and do not alter the core mm
> > much. However, even with memdesc, today the Single Owner Memory is not
> > singled out from the rest of memory types (shared, anon, named), so I
> > do not expect the memdescs can provide saving or optimizations for
> > this specific use case.
>
> With memdescs, let's suppose the malloc library asks for a 256kB
> allocation.  You end up using 8 bytes per page for the memdesc pointer
> (512 bytes) plus around 96 bytes for the folio that's used by the anon
> memory (assuming appropriate hinting / heuristics that says "Hey, treat
> this as a single allocation").

Also, the 256kB should be physically contiguous, right? Hopefully,
fragmentation is not going to be an issue, but we might need to look
into increasing the page migration enforcements in order to reduce
fragmentations  during allocs, and thus reduce the memory overheads.
Today, fragmentation can potentially reduce the performance when THPs
are not available but in the future with memdescs the fragmentation
might also effect the memory overhead. We might need to look into
changing some of the migration policies.

>  So that's 608 bytes of overhead for a
> 256kB allocation, or 0.23% overhead.  About half the overhead of 8kB
> per 2MB (plus whatever overhead the SOM driver has to track the 256kB
> of memory).

I like the idea of memdescs, and would like to stay involved in the
project development.  The potential memory savings are indeed
substantial.

> If 256kB isn't the right size to be doing this kind of analysis on, we
> can rerun it on whatever size you want.  I'm not really familiar with
> what userspace is doing these days.
>
> > > The savings you'll see are 6 pages (24kB) per 2MB allocated (1.2%).
> > > That's not nothing, but it's not huge either.
> >
> > This depends on the scale, in our fleet 1.2% savings are huge.
>
> Then 1.4% will be better, yes?  ;-)

Absolutely, 1.4% is even better. I mean 0% kernel  memory overhead
would be just about perfect :-)

Let me provide with a few more reasons how /dev/som can be helpful:

1. Independent memory pool.
While /dev/som itself always manages memory in 2M chunks it can be
configured to use memory from HugeTLB (2M or 1G), devdax, or kernel
external memory (i.e. memory that is not part of System RAM).

2. Low overhead
/dev/som will allocate memory from the pool in 1G chunks, and manage
it in 2M chunks. This will allow low memory overhead management via
bitmaps. List/tree of 2M chunks are going to be per user process, from
where the faults on som vmas are going to be handled.

3. All pages are migratable
Since som manages only user pages all pages are going to be required
to be migratable. In order to support FOLL_LONGTERM we will need to
make a decision either to migrate page to become a normal page (i.e.
Core-MM managed), or add a separate pool of long term pinned pages.
Even in today's kernel when we FOLL_LONGTERM a page it is migrated
outside of ZONE_MOVABLE.

4. 1G anonymous pages.
Since all pages are migratable, support for 1G anonymous pages can be
implemented. Unlike with Core-MM where THPs do not have struct page
optimizations, the som 4k, 2M, and 1G pages will all have reasonable
overhead from the beginning.

5. Performance benefit for running /dev/som in a virtual machine
The way extended page tables work is that translation cost in terms of
number of loads is not a simple summation of native page table +
extended page table, but actually (n * m + n +m) where n is number
page table levels on the guest, and m is number of page table levels
in extended page table. This is because the guest page table levels
themselves must be translated into host physical addresses using
extended page tables.

Since /dev/som allows for 1G anonymous pages, we can use guest
physical memory as virtual memory: i.e. only subset of 1G page in the
guest is actually backed by physical pages on the host, yet the access
to that subset is going to be substantially faster due to fewer page
table loads, and less TLB miss rate. I am proposing to have a separate
talk about this and other VM optimizations:
https://lore.kernel.org/linux-mm/CA+CK2bDr5Xii021JBXeyCEY4jjWCsZQ=ENa-s8MLkBv5hYUvsA@mail.gmail.com/

6. Security
There is a reduced risk of false sharing pages because they are
enforced to be single owner pages. This can help with avoiding some
bugs that we've seen in the past with refcount errors, for which I
wrote page_table_check, that since caught a few false sharing issues.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Single Owner Memory
  2023-02-21 15:55 ` Matthew Wilcox
@ 2023-02-21 17:20   ` Pasha Tatashin
  0 siblings, 0 replies; 9+ messages in thread
From: Pasha Tatashin @ 2023-02-21 17:20 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm

On Tue, Feb 21, 2023 at 10:55 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Feb 20, 2023 at 02:10:24PM -0500, Pasha Tatashin wrote:
> > The discussion should include the following topics:
> > -  Interaction with folio and the proposed struct page {memdesc}.
> > - Handling for migrate_pages() and friends.
> > - Handling for FOLL_PIN and FOLL_LONGTERM.
> > - What type of madvise() properties the som memory should handle
>
> Something I didn't see covered was how you'd want to handle memory
> pressure.  The answer for memdescs is that we'd treat each userspace

Indeed, this is something that should be covered. I had a few thoughts
about that, but it needs more work.
Some possibilities:

1. When memory is pressured we can migrate pages to normal memory, and
that would enable that memory to become swappable etc.
2. Teach in-memory compressions such as zswap/zram to work directly
with /dev/som.

> allocation as a single object; if you allocate a 256kB folio, that has
> one accessed bit (set every time any of the PTEs which reference that
> folio is accessed), one dirty bit, is aged on the LRU as a single unit
> and will be written to swap as a single unit.
>
> Assuming we're dealing with objects smaller than PMDs, we have a number
> of PTEs each of which has its own A and D bits, so we can determine
> at each revolution of the LRU clock whether it still makes sense to be
> treating the folio as a single unit, or whether pages in the first half
> of the folio are no longer being accessed and we should split the folio
> in half and age the two halves separately.

Interesting

>
> All of that is still theoretical; we don't allocate anon memory in sizes
> other than PAGE_SIZE and PMD size.  And we don't track page cache A and
> D bits to see whether the decision to allocate a particular page size was
> the right one (most page cache memory is never mapped into userspace, so
> it might be of limited value, but I'm sure we could track the equivalent
> information with read() and write()).

Thanks,
Pasha


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Single Owner Memory
  2023-02-21 17:16       ` Pasha Tatashin
@ 2023-02-22 16:18         ` Matthew Wilcox
  2023-02-22 16:40           ` Pasha Tatashin
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2023-02-22 16:18 UTC (permalink / raw)
  To: Pasha Tatashin; +Cc: lsf-pc, linux-mm

On Tue, Feb 21, 2023 at 12:16:27PM -0500, Pasha Tatashin wrote:
> > > > Obviously once we get to dynamically allocated memdescs, this whole
> > > > thing goes away, so I'm not excited about making big changes to the
> > > > kernel to support this.
> > >
> > > This is why the changes that I am thinking about are going to be
> > > mostly localized in a separate driver and do not alter the core mm
> > > much. However, even with memdesc, today the Single Owner Memory is not
> > > singled out from the rest of memory types (shared, anon, named), so I
> > > do not expect the memdescs can provide saving or optimizations for
> > > this specific use case.
> >
> > With memdescs, let's suppose the malloc library asks for a 256kB
> > allocation.  You end up using 8 bytes per page for the memdesc pointer
> > (512 bytes) plus around 96 bytes for the folio that's used by the anon
> > memory (assuming appropriate hinting / heuristics that says "Hey, treat
> > this as a single allocation").
> 
> Also, the 256kB should be physically contiguous, right? Hopefully,
> fragmentation is not going to be an issue, but we might need to look
> into increasing the page migration enforcements in order to reduce
> fragmentations  during allocs, and thus reduce the memory overheads.
> Today, fragmentation can potentially reduce the performance when THPs
> are not available but in the future with memdescs the fragmentation
> might also effect the memory overhead. We might need to look into
> changing some of the migration policies.

Yes, folios are always physically, virtually and logically contiguous,
and aligned, just like compound pages are today.  No plans to change that.

With more parts of the kernel using larger allocations, larger allocations
are easier to come by.  Clean pagecache is the easiest type of memory
to reclaim, and if the filesystem is using 64kB allocations instead of
4kB allocations, finding a contiguous 256kB only needs four consecutive
allocations to be freed rather than 64.  And if the page cache is trying
to allocate large contiguous amounts of memory, it's going to be kicking
kcompactd to make those happen more often.

We're never going to beat fragmentation all the time, but when we
lose to it, it just means that we end up allocating smaller folios,
not failing entirely.

> 1. Independent memory pool.
> While /dev/som itself always manages memory in 2M chunks it can be
> configured to use memory from HugeTLB (2M or 1G), devdax, or kernel
> external memory (i.e. memory that is not part of System RAM).

I'm not certain that's a good idea.  That memory isn't part of system
ram for a reason; maybe it's worse performance (or it's being saved
for better performance).  Handing it out to random users is going to
give unexpected performance problems.

> 2. Low overhead
> /dev/som will allocate memory from the pool in 1G chunks, and manage
> it in 2M chunks. This will allow low memory overhead management via
> bitmaps. List/tree of 2M chunks are going to be per user process, from
> where the faults on som vmas are going to be handled.

I'm not sure that it's going to be lower overhead than memdescs.
I have no idea what your data structures are going to be, so I can't do
an estimate.  I should warn you that I have a version of memdescs in mind
that has even more memory savings than the initial version (one pointer
per allocation instead of one pointer per page), so there's definitely
room for improvement.

One benefit that you don't mention is that /dev/som can almost certainly
be implemented faster than memdescs.  But memdescs are going to give you
better savings, so it's really going to be up to you which you want to
work on.

If it's a Google special that you keep internal, then I don't mind.  I'm
not sure whether we would still want to support the /dev/som interface
upstream after memdescs lands.  Maybe not; there always has to be a
userspace fallback to a som-less approach for older kernels, so
perhaps it can just be deleted after memdescs lands.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Single Owner Memory
  2023-02-22 16:18         ` Matthew Wilcox
@ 2023-02-22 16:40           ` Pasha Tatashin
  0 siblings, 0 replies; 9+ messages in thread
From: Pasha Tatashin @ 2023-02-22 16:40 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-mm

On Wed, Feb 22, 2023 at 11:18 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 21, 2023 at 12:16:27PM -0500, Pasha Tatashin wrote:
> > > > > Obviously once we get to dynamically allocated memdescs, this whole
> > > > > thing goes away, so I'm not excited about making big changes to the
> > > > > kernel to support this.
> > > >
> > > > This is why the changes that I am thinking about are going to be
> > > > mostly localized in a separate driver and do not alter the core mm
> > > > much. However, even with memdesc, today the Single Owner Memory is not
> > > > singled out from the rest of memory types (shared, anon, named), so I
> > > > do not expect the memdescs can provide saving or optimizations for
> > > > this specific use case.
> > >
> > > With memdescs, let's suppose the malloc library asks for a 256kB
> > > allocation.  You end up using 8 bytes per page for the memdesc pointer
> > > (512 bytes) plus around 96 bytes for the folio that's used by the anon
> > > memory (assuming appropriate hinting / heuristics that says "Hey, treat
> > > this as a single allocation").
> >
> > Also, the 256kB should be physically contiguous, right? Hopefully,
> > fragmentation is not going to be an issue, but we might need to look
> > into increasing the page migration enforcements in order to reduce
> > fragmentations  during allocs, and thus reduce the memory overheads.
> > Today, fragmentation can potentially reduce the performance when THPs
> > are not available but in the future with memdescs the fragmentation
> > might also effect the memory overhead. We might need to look into
> > changing some of the migration policies.
>
> Yes, folios are always physically, virtually and logically contiguous,
> and aligned, just like compound pages are today.  No plans to change that.
>
> With more parts of the kernel using larger allocations, larger allocations
> are easier to come by.  Clean pagecache is the easiest type of memory
> to reclaim, and if the filesystem is using 64kB allocations instead of
> 4kB allocations, finding a contiguous 256kB only needs four consecutive
> allocations to be freed rather than 64.  And if the page cache is trying
> to allocate large contiguous amounts of memory, it's going to be kicking
> kcompactd to make those happen more often.
>
> We're never going to beat fragmentation all the time, but when we
> lose to it, it just means that we end up allocating smaller folios,
> not failing entirely.
>
> > 1. Independent memory pool.
> > While /dev/som itself always manages memory in 2M chunks it can be
> > configured to use memory from HugeTLB (2M or 1G), devdax, or kernel
> > external memory (i.e. memory that is not part of System RAM).
>
> I'm not certain that's a good idea.  That memory isn't part of system
> ram for a reason; maybe it's worse performance (or it's being saved
> for better performance).  Handing it out to random users is going to
> give unexpected performance problems.

Even today such memory can be given to users via hot-plug: convert
pmem into devdax, and hotplug that into memory. However, that is not
ideal for several reasons: struct page overhead, mixes the memory with
the rest of the pages in the system, no easy way to enforce latency
policies etc.

Also, the kernel external memory does not have to be different from
regular ram, it can be regular memory where memmap kernel parameter
was used to remove most of the memory from the kernel management for
faster booting, and lower overhead. /dev/som can use such memory.

>
> > 2. Low overhead
> > /dev/som will allocate memory from the pool in 1G chunks, and manage
> > it in 2M chunks. This will allow low memory overhead management via
> > bitmaps. List/tree of 2M chunks are going to be per user process, from
> > where the faults on som vmas are going to be handled.
>
> I'm not sure that it's going to be lower overhead than memdescs.
> I have no idea what your data structures are going to be, so I can't do
> an estimate.  I should warn you that I have a version of memdescs in mind
> that has even more memory savings than the initial version (one pointer
> per allocation instead of one pointer per page), so there's definitely
> room for improvement.
>
> One benefit that you don't mention is that /dev/som can almost certainly
> be implemented faster than memdescs.  But memdescs are going to give you
> better savings, so it's really going to be up to you which you want to
> work on.

This right, but I specifically did not mention this benefit because I
actually want /dev/som to be compatible with memdesc in the long run
as well. Would you like to chat about the other potential memory
savings that you have in mind for memdescs during LSF/MM, perhaps a
brainstorming session?

>
> If it's a Google special that you keep internal, then I don't mind.  I'm
> not sure whether we would still want to support the /dev/som interface
> upstream after memdescs lands.  Maybe not; there always has to be a
> userspace fallback to a som-less approach for older kernels, so
> perhaps it can just be deleted after memdescs lands.

I have several benefits of /dev/som envisioned for a virtualized
environment, that I am not sure how to resolve without using it. We
might need it even after memdesc is fully implemented.

Thanks,
Pasha


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-02-22 16:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-20 19:10 [LSF/MM/BPF TOPIC] Single Owner Memory Pasha Tatashin
2023-02-21 13:46 ` Matthew Wilcox
2023-02-21 14:37   ` Pasha Tatashin
2023-02-21 15:05     ` Matthew Wilcox
2023-02-21 17:16       ` Pasha Tatashin
2023-02-22 16:18         ` Matthew Wilcox
2023-02-22 16:40           ` Pasha Tatashin
2023-02-21 15:55 ` Matthew Wilcox
2023-02-21 17:20   ` Pasha Tatashin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.