[LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
@ 2024-02-28 22:56 Khalid Aziz
  2024-02-29  9:21 ` David Hildenbrand
  2024-05-14 18:21 ` Christoph Lameter (Ampere)
  0 siblings, 2 replies; 8+ messages in thread
From: Khalid Aziz @ 2024-02-28 22:56 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm

Threads of a process share address space and page tables that allows for
two key advantages:

1. Amount of memory required for PTEs to map physical pages stays low
even when large number of threads share the same pages since PTEs are
shared across threads.

2. Page protection attributes are shared across threads and a change
of attributes applies immediately to every thread without any overhead
of coordinating protection bit changes across threads.

These advantages no longer apply when unrelated processes share pages.
Large database applications can easily comprise of 1000s of processes
that share 100s of GB of pages. In cases like this, amount of memory
consumed by page tables can exceed the size of actual shared data.
On a database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA even
though the system had 512GB of memory. On this server, in the worst case
scenario of all 1500 processes mapping every page from SGA would have
required 878GB+ for just the PTEs.

I have sent proposals and patches to solve this problem by adding a
mechanism to the kernel for processes to use to opt into sharing
page tables with other processes. We have had discussions on original
proposal and subsequent refinements but we have not converged on a
solution. As systems with multi-TB memory and in-memory databases
are becoming more and more common, this is becoming a significant issue.
An interactive discussion can help us reach a consensus on how to
solve this.

Thanks,
Khalid

References:

https://lore.kernel.org/lkml/cover.1642526745.git.khalid.aziz@oracle.com/
https://lore.kernel.org/lkml/cover.1656531090.git.khalid.aziz@oracle.com/
https://lore.kernel.org/lkml/cover.1682453344.git.khalid.aziz@oracle.com/
https://lore.kernel.org/lkml/4082bc40-a99a-4b54-91e5-a1b55828d202@oracle.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
  2024-02-28 22:56 [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare) Khalid Aziz
@ 2024-02-29  9:21 ` David Hildenbrand
  2024-02-29 14:12   ` Matthew Wilcox
  2024-03-04 16:45   ` Khalid Aziz
  2024-05-14 18:21 ` Christoph Lameter (Ampere)
  1 sibling, 2 replies; 8+ messages in thread
From: David Hildenbrand @ 2024-02-29  9:21 UTC (permalink / raw)
  To: Khalid Aziz, lsf-pc; +Cc: linux-mm

On 28.02.24 23:56, Khalid Aziz wrote:
> Threads of a process share address space and page tables that allows for
> two key advantages:
> 
> 1. Amount of memory required for PTEs to map physical pages stays low
> even when large number of threads share the same pages since PTEs are
> shared across threads.
> 
> 2. Page protection attributes are shared across threads and a change
> of attributes applies immediately to every thread without any overhead
> of coordinating protection bit changes across threads.
> 
> These advantages no longer apply when unrelated processes share pages.
> Large database applications can easily comprise of 1000s of processes
> that share 100s of GB of pages. In cases like this, amount of memory
> consumed by page tables can exceed the size of actual shared data.
> On a database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA even
> though the system had 512GB of memory. On this server, in the worst case
> scenario of all 1500 processes mapping every page from SGA would have
> required 878GB+ for just the PTEs.
> 
> I have sent proposals and patches to solve this problem by adding a
> mechanism to the kernel for processes to use to opt into sharing
> page tables with other processes. We have had discussions on original
> proposal and subsequent refinements but we have not converged on a
> solution. As systems with multi-TB memory and in-memory databases
> are becoming more and more common, this is becoming a significant issue.
> An interactive discussion can help us reach a consensus on how to
> solve this.

Hi,

I was hoping for a follow-up to my previous comments from ~4 months ago 
[1], so one problem of "not converging" might be "no follow-up discussion".

Ideally, this session would not focus on mshare as previously discussed 
at LSF/MM, but take a step back and discuss requirements and possible 
adjustments to the original concept to get something possibly cleaner.

For example, I raised some ideas to not having to re-route 
mprotect()/mmap() calls. At least discussing somehwere why they are all 
bad would be helpful ;)

[1] 
https://lore.kernel.org/lkml/927b6339-ac5f-480c-9cdc-49c838cbef20@redhat.com/

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
  2024-02-29  9:21 ` David Hildenbrand
@ 2024-02-29 14:12   ` Matthew Wilcox
  2024-02-29 15:15     ` David Hildenbrand
  2024-03-04 16:45   ` Khalid Aziz
  1 sibling, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2024-02-29 14:12 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Khalid Aziz, lsf-pc, linux-mm

On Thu, Feb 29, 2024 at 10:21:26AM +0100, David Hildenbrand wrote:
> On 28.02.24 23:56, Khalid Aziz wrote:
> > Threads of a process share address space and page tables that allows for
> > two key advantages:
> > 
> > 1. Amount of memory required for PTEs to map physical pages stays low
> > even when large number of threads share the same pages since PTEs are
> > shared across threads.
> > 
> > 2. Page protection attributes are shared across threads and a change
> > of attributes applies immediately to every thread without any overhead
> > of coordinating protection bit changes across threads.
> > 
> > These advantages no longer apply when unrelated processes share pages.
> > Large database applications can easily comprise of 1000s of processes
> > that share 100s of GB of pages. In cases like this, amount of memory
> > consumed by page tables can exceed the size of actual shared data.
> > On a database server with 300GB SGA, a system crash was seen with
> > out-of-memory condition when 1500+ clients tried to share this SGA even
> > though the system had 512GB of memory. On this server, in the worst case
> > scenario of all 1500 processes mapping every page from SGA would have
> > required 878GB+ for just the PTEs.
> > 
> > I have sent proposals and patches to solve this problem by adding a
> > mechanism to the kernel for processes to use to opt into sharing
> > page tables with other processes. We have had discussions on original
> > proposal and subsequent refinements but we have not converged on a
> > solution. As systems with multi-TB memory and in-memory databases
> > are becoming more and more common, this is becoming a significant issue.
> > An interactive discussion can help us reach a consensus on how to
> > solve this.
> 
> Hi,
> 
> I was hoping for a follow-up to my previous comments from ~4 months ago [1],
> so one problem of "not converging" might be "no follow-up discussion".
> 
> Ideally, this session would not focus on mshare as previously discussed at
> LSF/MM, but take a step back and discuss requirements and possible
> adjustments to the original concept to get something possibly cleaner.

I think the concept is clean.  Your concept doesn't fit our use case!
So essentially what you're asking for is for us to do a lot of work
which doesn't solve our problem.  You can imagine our lack of enthusiasm
for this.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
  2024-02-29 14:12   ` Matthew Wilcox
@ 2024-02-29 15:15     ` David Hildenbrand
  0 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2024-02-29 15:15 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Khalid Aziz, lsf-pc, linux-mm

On 29.02.24 15:12, Matthew Wilcox wrote:
> On Thu, Feb 29, 2024 at 10:21:26AM +0100, David Hildenbrand wrote:
>> On 28.02.24 23:56, Khalid Aziz wrote:
>>> Threads of a process share address space and page tables that allows for
>>> two key advantages:
>>>
>>> 1. Amount of memory required for PTEs to map physical pages stays low
>>> even when large number of threads share the same pages since PTEs are
>>> shared across threads.
>>>
>>> 2. Page protection attributes are shared across threads and a change
>>> of attributes applies immediately to every thread without any overhead
>>> of coordinating protection bit changes across threads.
>>>
>>> These advantages no longer apply when unrelated processes share pages.
>>> Large database applications can easily comprise of 1000s of processes
>>> that share 100s of GB of pages. In cases like this, amount of memory
>>> consumed by page tables can exceed the size of actual shared data.
>>> On a database server with 300GB SGA, a system crash was seen with
>>> out-of-memory condition when 1500+ clients tried to share this SGA even
>>> though the system had 512GB of memory. On this server, in the worst case
>>> scenario of all 1500 processes mapping every page from SGA would have
>>> required 878GB+ for just the PTEs.
>>>
>>> I have sent proposals and patches to solve this problem by adding a
>>> mechanism to the kernel for processes to use to opt into sharing
>>> page tables with other processes. We have had discussions on original
>>> proposal and subsequent refinements but we have not converged on a
>>> solution. As systems with multi-TB memory and in-memory databases
>>> are becoming more and more common, this is becoming a significant issue.
>>> An interactive discussion can help us reach a consensus on how to
>>> solve this.
>>
>> Hi,
>>
>> I was hoping for a follow-up to my previous comments from ~4 months ago [1],
>> so one problem of "not converging" might be "no follow-up discussion".
>>
>> Ideally, this session would not focus on mshare as previously discussed at
>> LSF/MM, but take a step back and discuss requirements and possible
>> adjustments to the original concept to get something possibly cleaner.
> 
> I think the concept is clean. 
> Your concept doesn't fit our use case!

Which one exactly are you talking about in particular?

I raised various alternatives/modifications for discussion, learning 
what works and what doesn't work on the way. (I never understood why 
protection on the pagecache level wouldn't work for your use case, but 
let's put that aside).

In my last mail, I had the following:

"
It's been a while, but I remember that the feedback in the room was
primarily that:
(a) the original mshare approach/implementation had a very dangerous
      smell to it. Rerouting mmap/mprotect/... is just absolutely nasty.
(b) that pure page table sharing itself might be itself a reasonable
      optimization worth having.

I still think generic page table sharing (as a pure optimization) can be
something reasonable to have, and can help existing use cases without
the need to modify any software (well, except maybe give a hint that it
might be reasonable).

As said, I see value in some fd-thingy that can be mmaped, but is
internally assembled from other fds (using protect ioctls, not mmap)
with sub-protection (using protect ioctls, not mprotect). The ioctls
would be minimal and clearly specified. Most madvise()/uffd/... would
simply fail when seeing a VMA that mmaps such a fd thingy. No rerouting
of mmap, munmap, mprotect, ...

Under the hood, one can use a MM to manage all that and share page
tables. But it would be an implementation detail.
"

So I do think original mshare could be done "less scary" [1] by exposing 
a different, well defined and restricted interface to manage the 
"content" of mshare.

There is a lot of stuff to describe I have in mind, but it doesn't make 
sense to describe if it won't solve your usecase.

In my world it would end up cleaner, and naive me would have thought 
that you would enjoy something close to original mshare, just a bit less 
scary :)

> So essentially what you're asking for is for us to do a lot of work
> which doesn't solve our problem.  You can imagine our lack of enthusiasm
> for this.

I recall that implementing generic page table sharing is a lot of work 
that Oracle isn't interested in doing that, fair enough, I understood that.

Really, the amount of work is unclear if we don't talk about the actual 
solution.

I cannot really do more than offer help like I did:

   "I'm happy to discuss further. In a bi-weekly MM meeting, off-list or
    here.".

But if my comments are so unreasonable that they are not even worth 
discussing them, likely I wouldn't be of any help in another mshare session.

[1] https://lwn.net/Articles/895217/

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
  2024-02-29  9:21 ` David Hildenbrand
  2024-02-29 14:12   ` Matthew Wilcox
@ 2024-03-04 16:45   ` Khalid Aziz
  2024-03-25 17:57     ` David Hildenbrand
  1 sibling, 1 reply; 8+ messages in thread
From: Khalid Aziz @ 2024-03-04 16:45 UTC (permalink / raw)
  To: David Hildenbrand, lsf-pc; +Cc: linux-mm

On 2/29/24 02:21, David Hildenbrand wrote:
> On 28.02.24 23:56, Khalid Aziz wrote:
>> Threads of a process share address space and page tables that allows for
>> two key advantages:
>>
>> 1. Amount of memory required for PTEs to map physical pages stays low
>> even when large number of threads share the same pages since PTEs are
>> shared across threads.
>>
>> 2. Page protection attributes are shared across threads and a change
>> of attributes applies immediately to every thread without any overhead
>> of coordinating protection bit changes across threads.
>>
>> These advantages no longer apply when unrelated processes share pages.
>> Large database applications can easily comprise of 1000s of processes
>> that share 100s of GB of pages. In cases like this, amount of memory
>> consumed by page tables can exceed the size of actual shared data.
>> On a database server with 300GB SGA, a system crash was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA even
>> though the system had 512GB of memory. On this server, in the worst case
>> scenario of all 1500 processes mapping every page from SGA would have
>> required 878GB+ for just the PTEs.
>>
>> I have sent proposals and patches to solve this problem by adding a
>> mechanism to the kernel for processes to use to opt into sharing
>> page tables with other processes. We have had discussions on original
>> proposal and subsequent refinements but we have not converged on a
>> solution. As systems with multi-TB memory and in-memory databases
>> are becoming more and more common, this is becoming a significant issue.
>> An interactive discussion can help us reach a consensus on how to
>> solve this.
> 
> Hi,
> 
> I was hoping for a follow-up to my previous comments from ~4 months ago [1], so one problem of "not converging" might be 
> "no follow-up discussion".
> 
> Ideally, this session would not focus on mshare as previously discussed at LSF/MM, but take a step back and discuss 
> requirements and possible adjustments to the original concept to get something possibly cleaner.
> 
> For example, I raised some ideas to not having to re-route mprotect()/mmap() calls. At least discussing somehwere why 
> they are all bad would be helpful ;)
> 
> [1] https://lore.kernel.org/lkml/927b6339-ac5f-480c-9cdc-49c838cbef20@redhat.com/
> 

Hi David,

That is fair. A face to face discussion can help resolve these more easily but I will attempt to address these here and 
maybe we can come to a better understanding of the requirements. I do want to focus on requirements and let that drive 
the implementation.

On 11/2/23 14:25, David Hildenbrand wrote:
 > On 01.11.23 23:40, Khalid Aziz wrote:
 >> is slow and impacts database performance significantly. For each process to have to handle a fault/signal whenever page
 >> protection is changed impacts every process. By sharing same PTE across all processes, any page protection changes apply
 >
 > ... and everyone has to get the fault and mprotect() again,
 >
 > Which is one of the reasons why I said that mprotect() is simply the wrong tool to use here.
 >
 > You want to protect a pagecache page from write access, catch write access and handle it, to then allow write-access
 > again without successive fault->signal. Something similar is being done by filesystems already with the writenotify
 > infrastructure I believe. You just don't get a signal on write access, because it's all handled internally in the FS.
 >

My understanding of requirement from database applications is they want to create a large shared memory region for 1000s 
of processes. This region can have file-backed pages or not. One of the processes can be the control process that serves 
as gatekeeper to various parts of this shared region. This process can open up write access to a part of shared region 
(which can span thousands of pages), populate/update data and then close down write access to this region. Any other 
process that tries to write to this region at this time can get a signal and choose to handle it or simply be killed. 
All the gatekeper process wants to do is close access to shared region at any time without having to coordinate that 
with 1000s of processes and let other processes deal with access having been closed. With this requirement, what 
database applications has found to be effective is to use mprotect() to apply protection to the part of shared region 
and then have it propagate across everyone attempting to access that region. Using currently available mechanism, that 
meant sending messages to every process to apply the same mprotect() bits to their own PTEs and honor gatekeeper 
request. With shared PTEs, opted into explicitly, protection bits for all processes change at the same time with no 
additional action required by 1000s of processes. That helps performance very significantly.

The second big win here is in memory saved that would have been used by PTEs in all the processes. The memory saved this 
way literally takes a system from being completely infeasible to a system with room to spare (referring to the case I 
had described in my original mail where we needed more memory to store PTEs than installed on the system).

 >> instantly to all processes (there is the TLB shootdown issue but as discussed in the meeting, it can be handled). The
 >> mshare proposal implements the instant page protection change while bringing in benefits of shared page tables at the
 >> same time. So the two requirements of this feature are not separable.
 >
 > Right, and I think we should talk about the problem we are trying to solve and not a solution to the problem. Because
 > the current solution really requires sharing of page tables, which I absolutely don't like.
 >
 > It absolutely makes no sense to bring in mprotect and VMAs when wanting to catch all write accesses to a pagecache page.
 > And because we still decide to do so, we have to come up with ways of making page table sharing a user-visible feature
 > with weird VMA semantics.

We are not trying to catch write access to pagecache page here. We simply want to prevent write access to a large 
multi-page memory region by all processes sharing it and do it instantly and efficiently by allowing gatekeeper to close 
the gates and call it done.

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
  2024-03-04 16:45   ` Khalid Aziz
@ 2024-03-25 17:57     ` David Hildenbrand
  0 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2024-03-25 17:57 UTC (permalink / raw)
  To: Khalid Aziz, lsf-pc; +Cc: linux-mm

>> Hi,
>>
>> I was hoping for a follow-up to my previous comments from ~4 months ago [1], so one problem of "not converging" might be
>> "no follow-up discussion".
>>
>> Ideally, this session would not focus on mshare as previously discussed at LSF/MM, but take a step back and discuss
>> requirements and possible adjustments to the original concept to get something possibly cleaner.
>>
>> For example, I raised some ideas to not having to re-route mprotect()/mmap() calls. At least discussing somehwere why
>> they are all bad would be helpful ;)
>>
>> [1] https://lore.kernel.org/lkml/927b6339-ac5f-480c-9cdc-49c838cbef20@redhat.com/
>>
> 
> Hi David,
> 
> That is fair. A face to face discussion can help resolve these more easily but I will attempt to address these here and
> maybe we can come to a better understanding of the requirements. I do want to focus on requirements and let that drive
> the implementation.

Hi Khalid,

sorry fro the late reply, my mailbox got a bit flooded.

> 
> On 11/2/23 14:25, David Hildenbrand wrote:
>   > On 01.11.23 23:40, Khalid Aziz wrote:
>   >> is slow and impacts database performance significantly. For each process to have to handle a fault/signal whenever page
>   >> protection is changed impacts every process. By sharing same PTE across all processes, any page protection changes apply
>   >
>   > ... and everyone has to get the fault and mprotect() again,
>   >
>   > Which is one of the reasons why I said that mprotect() is simply the wrong tool to use here.
>   >
>   > You want to protect a pagecache page from write access, catch write access and handle it, to then allow write-access
>   > again without successive fault->signal. Something similar is being done by filesystems already with the writenotify
>   > infrastructure I believe. You just don't get a signal on write access, because it's all handled internally in the FS.
>   >
> 
> My understanding of requirement from database applications is they want to create a large shared memory region for 1000s
> of processes. This region can have file-backed pages or not. One of the processes can be the control process that serves
> as gatekeeper to various parts of this shared region. This process can open up write access to a part of shared region
> (which can span thousands of pages), populate/update data and then close down write access to this region. Any other
> process that tries to write to this region at this time can get a signal and choose to handle it or simply be killed.

Got it.

> All the gatekeper process wants to do is close access to shared region at any time without having to coordinate that
> with 1000s of processes and let other processes deal with access having been closed. With this requirement, what
> database applications has found to be effective is to use mprotect() to apply protection to the part of shared region
> and then have it propagate across everyone attempting to access that region. Using currently available mechanism, that
> meant sending messages to every process to apply the same mprotect() bits to their own PTEs and honor gatekeeper

Yes, mprotect() over multiple processes is indeed stupid. It's also the 
same one currently has to do with uffd-wp: each process has to protect 
the pages in their own page tables.

> request. With shared PTEs, opted into explicitly, protection bits for all processes change at the same time with no
> additional action required by 1000s of processes. That helps performance very significantly.
> 
> The second big win here is in memory saved that would have been used by PTEs in all the processes. The memory saved this
> way literally takes a system from being completely infeasible to a system with room to spare (referring to the case I
> had described in my original mail where we needed more memory to store PTEs than installed on the system).

Yes, I understood all that.

> 
>   >> instantly to all processes (there is the TLB shootdown issue but as discussed in the meeting, it can be handled). The
>   >> mshare proposal implements the instant page protection change while bringing in benefits of shared page tables at the
>   >> same time. So the two requirements of this feature are not separable.
>   >
>   > Right, and I think we should talk about the problem we are trying to solve and not a solution to the problem. Because
>   > the current solution really requires sharing of page tables, which I absolutely don't like.
>   >
>   > It absolutely makes no sense to bring in mprotect and VMAs when wanting to catch all write accesses to a pagecache page.
>   > And because we still decide to do so, we have to come up with ways of making page table sharing a user-visible feature
>   > with weird VMA semantics.
> 
> We are not trying to catch write access to pagecache page here. We simply want to prevent write access to a large
> multi-page memory region by all processes sharing it and do it instantly and efficiently by allowing gatekeeper to close
> the gates and call it done.

Thanks for these details!


I'll have a bunch of other questions. Finding some way to discuss them 
with you in detail would be great. Will you be at LSF/MM so we can talk 
in person? Ideally, we could talk before any LSF/MM session.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
  2024-02-28 22:56 [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare) Khalid Aziz
  2024-02-29  9:21 ` David Hildenbrand
@ 2024-05-14 18:21 ` Christoph Lameter (Ampere)
  2024-05-17 21:23   ` Khalid Aziz
  1 sibling, 1 reply; 8+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-05-14 18:21 UTC (permalink / raw)
  To: Khalid Aziz; +Cc: lsf-pc, linux-mm

> 1. Amount of memory required for PTEs to map physical pages stays low
> even when large number of threads share the same pages since PTEs are
> shared across threads.
>
> 2. Page protection attributes are shared across threads and a change
> of attributes applies immediately to every thread without any overhead
> of coordinating protection bit changes across threads.
>
> These advantages no longer apply when unrelated processes share pages.
> Large database applications can easily comprise of 1000s of processes
> that share 100s of GB of pages. In cases like this, amount of memory
> consumed by page tables can exceed the size of actual shared data.
> On a database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA even
> though the system had 512GB of memory. On this server, in the worst case
> scenario of all 1500 processes mapping every page from SGA would have
> required 878GB+ for just the PTEs.

Ok then use 1Gig pages or higher for a shared mapping of huge pages. I am 
not sure why there is a need for sharing page tables here. I just listened 
to your talk at the LSF/MM and noted some things.

It may be best to follow established shared memory approaches like for 
example implemented already in shmem.

If you want to do it with actually sharing page table semantics then the 
proper implementation using shmem would be maybe to add an additional 
flag. Lets call this O_SHARED_PAGE_TABLE for now.

Then you would do

fd = shmem_open("shared_pagetable_segment", O_CREATE|O_RDWR|O_SHARED_PAGE_TABLE, 0666);

The remaining handling is straightforward and the shmem subsystem already 
provides consistent handling of shared memory segments.

What you would have to do is to sort out the kernel internal problems 
created by sharing page table sections when using SHM vmas. But with that 
there are only limited changes required to special types of vma and the 
shmem subsystem. So the impact on the kernel overall is limited and you 
are following an established method of managing shared memory.

I actually need something like shared page tables also for another in 
kernel page table use case in order to define sections in kernel virtual 
memory that are special for cpus or nodes. Some abstracted functions to 
manage page tables that share pgd,pud,pmd would be good to have in the 
kernel if you dont mind.

But for this use case I'd suggest to use gigabyte shmem mappings and 
be done with it.

https://lwn.net/Articles/375098/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
  2024-05-14 18:21 ` Christoph Lameter (Ampere)
@ 2024-05-17 21:23   ` Khalid Aziz
  0 siblings, 0 replies; 8+ messages in thread
From: Khalid Aziz @ 2024-05-17 21:23 UTC (permalink / raw)
  To: Christoph Lameter (Ampere); +Cc: lsf-pc, linux-mm

On 5/14/24 12:21, Christoph Lameter (Ampere) wrote:
>> 1. Amount of memory required for PTEs to map physical pages stays low
>> even when large number of threads share the same pages since PTEs are
>> shared across threads.
>>
>> 2. Page protection attributes are shared across threads and a change
>> of attributes applies immediately to every thread without any overhead
>> of coordinating protection bit changes across threads.
>>
>> These advantages no longer apply when unrelated processes share pages.
>> Large database applications can easily comprise of 1000s of processes
>> that share 100s of GB of pages. In cases like this, amount of memory
>> consumed by page tables can exceed the size of actual shared data.
>> On a database server with 300GB SGA, a system crash was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA even
>> though the system had 512GB of memory. On this server, in the worst case
>> scenario of all 1500 processes mapping every page from SGA would have
>> required 878GB+ for just the PTEs.
> 
> Ok then use 1Gig pages or higher for a shared mapping of huge pages. I am not sure why there is a need for sharing page 
> tables here. I just listened to your talk at the LSF/MM and noted some things.
> 
> 
> It may be best to follow established shared memory approaches like for example implemented already in shmem.
> 
> If you want to do it with actually sharing page table semantics then the proper implementation using shmem would be 
> maybe to add an additional flag. Lets call this O_SHARED_PAGE_TABLE for now.
> 
> Then you would do
> 
> fd = shmem_open("shared_pagetable_segment", O_CREATE|O_RDWR|O_SHARED_PAGE_TABLE, 0666);
> 
> The remaining handling is straightforward and the shmem subsystem already provides consistent handling of shared memory 
> segments.
> 
> What you would have to do is to sort out the kernel internal problems created by sharing page table sections when using 
> SHM vmas. But with that there are only limited changes required to special types of vma and the shmem subsystem. So the 
> impact on the kernel overall is limited and you are following an established method of managing shared memory.
> 
> I actually need something like shared page tables also for another in kernel page table use case in order to define 
> sections in kernel virtual memory that are special for cpus or nodes. Some abstracted functions to manage page tables 
> that share pgd,pud,pmd would be good to have in the kernel if you dont mind.
> 
> But for this use case I'd suggest to use gigabyte shmem mappings and be done with it.
> 
> https://lwn.net/Articles/375098/

Hello Christoph,

Thanks for the feedback. Yes, shmem can address this specific case and a solution using shmem with hugepages is in use 
currently. Two issues with that - (1) it addresses only this specific problem and does not address page table sharing in 
a general case which from hearing from many other people is indeed needed, (2) hugepages have to be pre-allocated which 
is not a flexible solution. Even though hugepages can be added at any time, kernel does it on best effort basis and 
latency to get the required number of hugepages can be unpredictable. So a more general solution that does not depend 
upon hugepages can be more useful in the long run and it can help other cases as well, like yours.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-05-17 21:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-28 22:56 [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare) Khalid Aziz
2024-02-29  9:21 ` David Hildenbrand
2024-02-29 14:12   ` Matthew Wilcox
2024-02-29 15:15     ` David Hildenbrand
2024-03-04 16:45   ` Khalid Aziz
2024-03-25 17:57     ` David Hildenbrand
2024-05-14 18:21 ` Christoph Lameter (Ampere)
2024-05-17 21:23   ` Khalid Aziz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).