* [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
@ 2024-02-28 22:56 Khalid Aziz
2024-02-29 9:21 ` David Hildenbrand
2024-05-14 18:21 ` Christoph Lameter (Ampere)
0 siblings, 2 replies; 8+ messages in thread
From: Khalid Aziz @ 2024-02-28 22:56 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm
Threads of a process share address space and page tables that allows for
two key advantages:
1. Amount of memory required for PTEs to map physical pages stays low
even when large number of threads share the same pages since PTEs are
shared across threads.
2. Page protection attributes are shared across threads and a change
of attributes applies immediately to every thread without any overhead
of coordinating protection bit changes across threads.
These advantages no longer apply when unrelated processes share pages.
Large database applications can easily comprise of 1000s of processes
that share 100s of GB of pages. In cases like this, amount of memory
consumed by page tables can exceed the size of actual shared data.
On a database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA even
though the system had 512GB of memory. On this server, in the worst case
scenario of all 1500 processes mapping every page from SGA would have
required 878GB+ for just the PTEs.
I have sent proposals and patches to solve this problem by adding a
mechanism to the kernel for processes to use to opt into sharing
page tables with other processes. We have had discussions on original
proposal and subsequent refinements but we have not converged on a
solution. As systems with multi-TB memory and in-memory databases
are becoming more and more common, this is becoming a significant issue.
An interactive discussion can help us reach a consensus on how to
solve this.
Thanks,
Khalid
References:
https://lore.kernel.org/lkml/cover.1642526745.git.khalid.aziz@oracle.com/
https://lore.kernel.org/lkml/cover.1656531090.git.khalid.aziz@oracle.com/
https://lore.kernel.org/lkml/cover.1682453344.git.khalid.aziz@oracle.com/
https://lore.kernel.org/lkml/4082bc40-a99a-4b54-91e5-a1b55828d202@oracle.com/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
2024-02-28 22:56 [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare) Khalid Aziz
@ 2024-02-29 9:21 ` David Hildenbrand
2024-02-29 14:12 ` Matthew Wilcox
2024-03-04 16:45 ` Khalid Aziz
2024-05-14 18:21 ` Christoph Lameter (Ampere)
1 sibling, 2 replies; 8+ messages in thread
From: David Hildenbrand @ 2024-02-29 9:21 UTC (permalink / raw)
To: Khalid Aziz, lsf-pc; +Cc: linux-mm
On 28.02.24 23:56, Khalid Aziz wrote:
> Threads of a process share address space and page tables that allows for
> two key advantages:
>
> 1. Amount of memory required for PTEs to map physical pages stays low
> even when large number of threads share the same pages since PTEs are
> shared across threads.
>
> 2. Page protection attributes are shared across threads and a change
> of attributes applies immediately to every thread without any overhead
> of coordinating protection bit changes across threads.
>
> These advantages no longer apply when unrelated processes share pages.
> Large database applications can easily comprise of 1000s of processes
> that share 100s of GB of pages. In cases like this, amount of memory
> consumed by page tables can exceed the size of actual shared data.
> On a database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA even
> though the system had 512GB of memory. On this server, in the worst case
> scenario of all 1500 processes mapping every page from SGA would have
> required 878GB+ for just the PTEs.
>
> I have sent proposals and patches to solve this problem by adding a
> mechanism to the kernel for processes to use to opt into sharing
> page tables with other processes. We have had discussions on original
> proposal and subsequent refinements but we have not converged on a
> solution. As systems with multi-TB memory and in-memory databases
> are becoming more and more common, this is becoming a significant issue.
> An interactive discussion can help us reach a consensus on how to
> solve this.
Hi,
I was hoping for a follow-up to my previous comments from ~4 months ago
[1], so one problem of "not converging" might be "no follow-up discussion".
Ideally, this session would not focus on mshare as previously discussed
at LSF/MM, but take a step back and discuss requirements and possible
adjustments to the original concept to get something possibly cleaner.
For example, I raised some ideas to not having to re-route
mprotect()/mmap() calls. At least discussing somehwere why they are all
bad would be helpful ;)
[1]
https://lore.kernel.org/lkml/927b6339-ac5f-480c-9cdc-49c838cbef20@redhat.com/
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
2024-02-29 9:21 ` David Hildenbrand
@ 2024-02-29 14:12 ` Matthew Wilcox
2024-02-29 15:15 ` David Hildenbrand
2024-03-04 16:45 ` Khalid Aziz
1 sibling, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2024-02-29 14:12 UTC (permalink / raw)
To: David Hildenbrand; +Cc: Khalid Aziz, lsf-pc, linux-mm
On Thu, Feb 29, 2024 at 10:21:26AM +0100, David Hildenbrand wrote:
> On 28.02.24 23:56, Khalid Aziz wrote:
> > Threads of a process share address space and page tables that allows for
> > two key advantages:
> >
> > 1. Amount of memory required for PTEs to map physical pages stays low
> > even when large number of threads share the same pages since PTEs are
> > shared across threads.
> >
> > 2. Page protection attributes are shared across threads and a change
> > of attributes applies immediately to every thread without any overhead
> > of coordinating protection bit changes across threads.
> >
> > These advantages no longer apply when unrelated processes share pages.
> > Large database applications can easily comprise of 1000s of processes
> > that share 100s of GB of pages. In cases like this, amount of memory
> > consumed by page tables can exceed the size of actual shared data.
> > On a database server with 300GB SGA, a system crash was seen with
> > out-of-memory condition when 1500+ clients tried to share this SGA even
> > though the system had 512GB of memory. On this server, in the worst case
> > scenario of all 1500 processes mapping every page from SGA would have
> > required 878GB+ for just the PTEs.
> >
> > I have sent proposals and patches to solve this problem by adding a
> > mechanism to the kernel for processes to use to opt into sharing
> > page tables with other processes. We have had discussions on original
> > proposal and subsequent refinements but we have not converged on a
> > solution. As systems with multi-TB memory and in-memory databases
> > are becoming more and more common, this is becoming a significant issue.
> > An interactive discussion can help us reach a consensus on how to
> > solve this.
>
> Hi,
>
> I was hoping for a follow-up to my previous comments from ~4 months ago [1],
> so one problem of "not converging" might be "no follow-up discussion".
>
> Ideally, this session would not focus on mshare as previously discussed at
> LSF/MM, but take a step back and discuss requirements and possible
> adjustments to the original concept to get something possibly cleaner.
I think the concept is clean. Your concept doesn't fit our use case!
So essentially what you're asking for is for us to do a lot of work
which doesn't solve our problem. You can imagine our lack of enthusiasm
for this.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
2024-02-29 14:12 ` Matthew Wilcox
@ 2024-02-29 15:15 ` David Hildenbrand
0 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2024-02-29 15:15 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: Khalid Aziz, lsf-pc, linux-mm
On 29.02.24 15:12, Matthew Wilcox wrote:
> On Thu, Feb 29, 2024 at 10:21:26AM +0100, David Hildenbrand wrote:
>> On 28.02.24 23:56, Khalid Aziz wrote:
>>> Threads of a process share address space and page tables that allows for
>>> two key advantages:
>>>
>>> 1. Amount of memory required for PTEs to map physical pages stays low
>>> even when large number of threads share the same pages since PTEs are
>>> shared across threads.
>>>
>>> 2. Page protection attributes are shared across threads and a change
>>> of attributes applies immediately to every thread without any overhead
>>> of coordinating protection bit changes across threads.
>>>
>>> These advantages no longer apply when unrelated processes share pages.
>>> Large database applications can easily comprise of 1000s of processes
>>> that share 100s of GB of pages. In cases like this, amount of memory
>>> consumed by page tables can exceed the size of actual shared data.
>>> On a database server with 300GB SGA, a system crash was seen with
>>> out-of-memory condition when 1500+ clients tried to share this SGA even
>>> though the system had 512GB of memory. On this server, in the worst case
>>> scenario of all 1500 processes mapping every page from SGA would have
>>> required 878GB+ for just the PTEs.
>>>
>>> I have sent proposals and patches to solve this problem by adding a
>>> mechanism to the kernel for processes to use to opt into sharing
>>> page tables with other processes. We have had discussions on original
>>> proposal and subsequent refinements but we have not converged on a
>>> solution. As systems with multi-TB memory and in-memory databases
>>> are becoming more and more common, this is becoming a significant issue.
>>> An interactive discussion can help us reach a consensus on how to
>>> solve this.
>>
>> Hi,
>>
>> I was hoping for a follow-up to my previous comments from ~4 months ago [1],
>> so one problem of "not converging" might be "no follow-up discussion".
>>
>> Ideally, this session would not focus on mshare as previously discussed at
>> LSF/MM, but take a step back and discuss requirements and possible
>> adjustments to the original concept to get something possibly cleaner.
>
> I think the concept is clean.
> Your concept doesn't fit our use case!
Which one exactly are you talking about in particular?
I raised various alternatives/modifications for discussion, learning
what works and what doesn't work on the way. (I never understood why
protection on the pagecache level wouldn't work for your use case, but
let's put that aside).
In my last mail, I had the following:
"
It's been a while, but I remember that the feedback in the room was
primarily that:
(a) the original mshare approach/implementation had a very dangerous
smell to it. Rerouting mmap/mprotect/... is just absolutely nasty.
(b) that pure page table sharing itself might be itself a reasonable
optimization worth having.
I still think generic page table sharing (as a pure optimization) can be
something reasonable to have, and can help existing use cases without
the need to modify any software (well, except maybe give a hint that it
might be reasonable).
As said, I see value in some fd-thingy that can be mmaped, but is
internally assembled from other fds (using protect ioctls, not mmap)
with sub-protection (using protect ioctls, not mprotect). The ioctls
would be minimal and clearly specified. Most madvise()/uffd/... would
simply fail when seeing a VMA that mmaps such a fd thingy. No rerouting
of mmap, munmap, mprotect, ...
Under the hood, one can use a MM to manage all that and share page
tables. But it would be an implementation detail.
"
So I do think original mshare could be done "less scary" [1] by exposing
a different, well defined and restricted interface to manage the
"content" of mshare.
There is a lot of stuff to describe I have in mind, but it doesn't make
sense to describe if it won't solve your usecase.
In my world it would end up cleaner, and naive me would have thought
that you would enjoy something close to original mshare, just a bit less
scary :)
> So essentially what you're asking for is for us to do a lot of work
> which doesn't solve our problem. You can imagine our lack of enthusiasm
> for this.
I recall that implementing generic page table sharing is a lot of work
that Oracle isn't interested in doing that, fair enough, I understood that.
Really, the amount of work is unclear if we don't talk about the actual
solution.
I cannot really do more than offer help like I did:
"I'm happy to discuss further. In a bi-weekly MM meeting, off-list or
here.".
But if my comments are so unreasonable that they are not even worth
discussing them, likely I wouldn't be of any help in another mshare session.
[1] https://lwn.net/Articles/895217/
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
2024-02-29 9:21 ` David Hildenbrand
2024-02-29 14:12 ` Matthew Wilcox
@ 2024-03-04 16:45 ` Khalid Aziz
2024-03-25 17:57 ` David Hildenbrand
1 sibling, 1 reply; 8+ messages in thread
From: Khalid Aziz @ 2024-03-04 16:45 UTC (permalink / raw)
To: David Hildenbrand, lsf-pc; +Cc: linux-mm
On 2/29/24 02:21, David Hildenbrand wrote:
> On 28.02.24 23:56, Khalid Aziz wrote:
>> Threads of a process share address space and page tables that allows for
>> two key advantages:
>>
>> 1. Amount of memory required for PTEs to map physical pages stays low
>> even when large number of threads share the same pages since PTEs are
>> shared across threads.
>>
>> 2. Page protection attributes are shared across threads and a change
>> of attributes applies immediately to every thread without any overhead
>> of coordinating protection bit changes across threads.
>>
>> These advantages no longer apply when unrelated processes share pages.
>> Large database applications can easily comprise of 1000s of processes
>> that share 100s of GB of pages. In cases like this, amount of memory
>> consumed by page tables can exceed the size of actual shared data.
>> On a database server with 300GB SGA, a system crash was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA even
>> though the system had 512GB of memory. On this server, in the worst case
>> scenario of all 1500 processes mapping every page from SGA would have
>> required 878GB+ for just the PTEs.
>>
>> I have sent proposals and patches to solve this problem by adding a
>> mechanism to the kernel for processes to use to opt into sharing
>> page tables with other processes. We have had discussions on original
>> proposal and subsequent refinements but we have not converged on a
>> solution. As systems with multi-TB memory and in-memory databases
>> are becoming more and more common, this is becoming a significant issue.
>> An interactive discussion can help us reach a consensus on how to
>> solve this.
>
> Hi,
>
> I was hoping for a follow-up to my previous comments from ~4 months ago [1], so one problem of "not converging" might be
> "no follow-up discussion".
>
> Ideally, this session would not focus on mshare as previously discussed at LSF/MM, but take a step back and discuss
> requirements and possible adjustments to the original concept to get something possibly cleaner.
>
> For example, I raised some ideas to not having to re-route mprotect()/mmap() calls. At least discussing somehwere why
> they are all bad would be helpful ;)
>
> [1] https://lore.kernel.org/lkml/927b6339-ac5f-480c-9cdc-49c838cbef20@redhat.com/
>
Hi David,
That is fair. A face to face discussion can help resolve these more easily but I will attempt to address these here and
maybe we can come to a better understanding of the requirements. I do want to focus on requirements and let that drive
the implementation.
On 11/2/23 14:25, David Hildenbrand wrote:
> On 01.11.23 23:40, Khalid Aziz wrote:
>> is slow and impacts database performance significantly. For each process to have to handle a fault/signal whenever page
>> protection is changed impacts every process. By sharing same PTE across all processes, any page protection changes apply
>
> ... and everyone has to get the fault and mprotect() again,
>
> Which is one of the reasons why I said that mprotect() is simply the wrong tool to use here.
>
> You want to protect a pagecache page from write access, catch write access and handle it, to then allow write-access
> again without successive fault->signal. Something similar is being done by filesystems already with the writenotify
> infrastructure I believe. You just don't get a signal on write access, because it's all handled internally in the FS.
>
My understanding of requirement from database applications is they want to create a large shared memory region for 1000s
of processes. This region can have file-backed pages or not. One of the processes can be the control process that serves
as gatekeeper to various parts of this shared region. This process can open up write access to a part of shared region
(which can span thousands of pages), populate/update data and then close down write access to this region. Any other
process that tries to write to this region at this time can get a signal and choose to handle it or simply be killed.
All the gatekeper process wants to do is close access to shared region at any time without having to coordinate that
with 1000s of processes and let other processes deal with access having been closed. With this requirement, what
database applications has found to be effective is to use mprotect() to apply protection to the part of shared region
and then have it propagate across everyone attempting to access that region. Using currently available mechanism, that
meant sending messages to every process to apply the same mprotect() bits to their own PTEs and honor gatekeeper
request. With shared PTEs, opted into explicitly, protection bits for all processes change at the same time with no
additional action required by 1000s of processes. That helps performance very significantly.
The second big win here is in memory saved that would have been used by PTEs in all the processes. The memory saved this
way literally takes a system from being completely infeasible to a system with room to spare (referring to the case I
had described in my original mail where we needed more memory to store PTEs than installed on the system).
>> instantly to all processes (there is the TLB shootdown issue but as discussed in the meeting, it can be handled). The
>> mshare proposal implements the instant page protection change while bringing in benefits of shared page tables at the
>> same time. So the two requirements of this feature are not separable.
>
> Right, and I think we should talk about the problem we are trying to solve and not a solution to the problem. Because
> the current solution really requires sharing of page tables, which I absolutely don't like.
>
> It absolutely makes no sense to bring in mprotect and VMAs when wanting to catch all write accesses to a pagecache page.
> And because we still decide to do so, we have to come up with ways of making page table sharing a user-visible feature
> with weird VMA semantics.
We are not trying to catch write access to pagecache page here. We simply want to prevent write access to a large
multi-page memory region by all processes sharing it and do it instantly and efficiently by allowing gatekeeper to close
the gates and call it done.
Thanks,
Khalid
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
2024-03-04 16:45 ` Khalid Aziz
@ 2024-03-25 17:57 ` David Hildenbrand
0 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2024-03-25 17:57 UTC (permalink / raw)
To: Khalid Aziz, lsf-pc; +Cc: linux-mm
>> Hi,
>>
>> I was hoping for a follow-up to my previous comments from ~4 months ago [1], so one problem of "not converging" might be
>> "no follow-up discussion".
>>
>> Ideally, this session would not focus on mshare as previously discussed at LSF/MM, but take a step back and discuss
>> requirements and possible adjustments to the original concept to get something possibly cleaner.
>>
>> For example, I raised some ideas to not having to re-route mprotect()/mmap() calls. At least discussing somehwere why
>> they are all bad would be helpful ;)
>>
>> [1] https://lore.kernel.org/lkml/927b6339-ac5f-480c-9cdc-49c838cbef20@redhat.com/
>>
>
> Hi David,
>
> That is fair. A face to face discussion can help resolve these more easily but I will attempt to address these here and
> maybe we can come to a better understanding of the requirements. I do want to focus on requirements and let that drive
> the implementation.
Hi Khalid,
sorry fro the late reply, my mailbox got a bit flooded.
>
> On 11/2/23 14:25, David Hildenbrand wrote:
> > On 01.11.23 23:40, Khalid Aziz wrote:
> >> is slow and impacts database performance significantly. For each process to have to handle a fault/signal whenever page
> >> protection is changed impacts every process. By sharing same PTE across all processes, any page protection changes apply
> >
> > ... and everyone has to get the fault and mprotect() again,
> >
> > Which is one of the reasons why I said that mprotect() is simply the wrong tool to use here.
> >
> > You want to protect a pagecache page from write access, catch write access and handle it, to then allow write-access
> > again without successive fault->signal. Something similar is being done by filesystems already with the writenotify
> > infrastructure I believe. You just don't get a signal on write access, because it's all handled internally in the FS.
> >
>
> My understanding of requirement from database applications is they want to create a large shared memory region for 1000s
> of processes. This region can have file-backed pages or not. One of the processes can be the control process that serves
> as gatekeeper to various parts of this shared region. This process can open up write access to a part of shared region
> (which can span thousands of pages), populate/update data and then close down write access to this region. Any other
> process that tries to write to this region at this time can get a signal and choose to handle it or simply be killed.
Got it.
> All the gatekeper process wants to do is close access to shared region at any time without having to coordinate that
> with 1000s of processes and let other processes deal with access having been closed. With this requirement, what
> database applications has found to be effective is to use mprotect() to apply protection to the part of shared region
> and then have it propagate across everyone attempting to access that region. Using currently available mechanism, that
> meant sending messages to every process to apply the same mprotect() bits to their own PTEs and honor gatekeeper
Yes, mprotect() over multiple processes is indeed stupid. It's also the
same one currently has to do with uffd-wp: each process has to protect
the pages in their own page tables.
> request. With shared PTEs, opted into explicitly, protection bits for all processes change at the same time with no
> additional action required by 1000s of processes. That helps performance very significantly.
>
> The second big win here is in memory saved that would have been used by PTEs in all the processes. The memory saved this
> way literally takes a system from being completely infeasible to a system with room to spare (referring to the case I
> had described in my original mail where we needed more memory to store PTEs than installed on the system).
Yes, I understood all that.
>
> >> instantly to all processes (there is the TLB shootdown issue but as discussed in the meeting, it can be handled). The
> >> mshare proposal implements the instant page protection change while bringing in benefits of shared page tables at the
> >> same time. So the two requirements of this feature are not separable.
> >
> > Right, and I think we should talk about the problem we are trying to solve and not a solution to the problem. Because
> > the current solution really requires sharing of page tables, which I absolutely don't like.
> >
> > It absolutely makes no sense to bring in mprotect and VMAs when wanting to catch all write accesses to a pagecache page.
> > And because we still decide to do so, we have to come up with ways of making page table sharing a user-visible feature
> > with weird VMA semantics.
>
> We are not trying to catch write access to pagecache page here. We simply want to prevent write access to a large
> multi-page memory region by all processes sharing it and do it instantly and efficiently by allowing gatekeeper to close
> the gates and call it done.
Thanks for these details!
I'll have a bunch of other questions. Finding some way to discuss them
with you in detail would be great. Will you be at LSF/MM so we can talk
in person? Ideally, we could talk before any LSF/MM session.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
2024-02-28 22:56 [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare) Khalid Aziz
2024-02-29 9:21 ` David Hildenbrand
@ 2024-05-14 18:21 ` Christoph Lameter (Ampere)
2024-05-17 21:23 ` Khalid Aziz
1 sibling, 1 reply; 8+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-05-14 18:21 UTC (permalink / raw)
To: Khalid Aziz; +Cc: lsf-pc, linux-mm
> 1. Amount of memory required for PTEs to map physical pages stays low
> even when large number of threads share the same pages since PTEs are
> shared across threads.
>
> 2. Page protection attributes are shared across threads and a change
> of attributes applies immediately to every thread without any overhead
> of coordinating protection bit changes across threads.
>
> These advantages no longer apply when unrelated processes share pages.
> Large database applications can easily comprise of 1000s of processes
> that share 100s of GB of pages. In cases like this, amount of memory
> consumed by page tables can exceed the size of actual shared data.
> On a database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA even
> though the system had 512GB of memory. On this server, in the worst case
> scenario of all 1500 processes mapping every page from SGA would have
> required 878GB+ for just the PTEs.
Ok then use 1Gig pages or higher for a shared mapping of huge pages. I am
not sure why there is a need for sharing page tables here. I just listened
to your talk at the LSF/MM and noted some things.
It may be best to follow established shared memory approaches like for
example implemented already in shmem.
If you want to do it with actually sharing page table semantics then the
proper implementation using shmem would be maybe to add an additional
flag. Lets call this O_SHARED_PAGE_TABLE for now.
Then you would do
fd = shmem_open("shared_pagetable_segment", O_CREATE|O_RDWR|O_SHARED_PAGE_TABLE, 0666);
The remaining handling is straightforward and the shmem subsystem already
provides consistent handling of shared memory segments.
What you would have to do is to sort out the kernel internal problems
created by sharing page table sections when using SHM vmas. But with that
there are only limited changes required to special types of vma and the
shmem subsystem. So the impact on the kernel overall is limited and you
are following an established method of managing shared memory.
I actually need something like shared page tables also for another in
kernel page table use case in order to define sections in kernel virtual
memory that are special for cpus or nodes. Some abstracted functions to
manage page tables that share pgd,pud,pmd would be good to have in the
kernel if you dont mind.
But for this use case I'd suggest to use gigabyte shmem mappings and
be done with it.
https://lwn.net/Articles/375098/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare)
2024-05-14 18:21 ` Christoph Lameter (Ampere)
@ 2024-05-17 21:23 ` Khalid Aziz
0 siblings, 0 replies; 8+ messages in thread
From: Khalid Aziz @ 2024-05-17 21:23 UTC (permalink / raw)
To: Christoph Lameter (Ampere); +Cc: lsf-pc, linux-mm
On 5/14/24 12:21, Christoph Lameter (Ampere) wrote:
>> 1. Amount of memory required for PTEs to map physical pages stays low
>> even when large number of threads share the same pages since PTEs are
>> shared across threads.
>>
>> 2. Page protection attributes are shared across threads and a change
>> of attributes applies immediately to every thread without any overhead
>> of coordinating protection bit changes across threads.
>>
>> These advantages no longer apply when unrelated processes share pages.
>> Large database applications can easily comprise of 1000s of processes
>> that share 100s of GB of pages. In cases like this, amount of memory
>> consumed by page tables can exceed the size of actual shared data.
>> On a database server with 300GB SGA, a system crash was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA even
>> though the system had 512GB of memory. On this server, in the worst case
>> scenario of all 1500 processes mapping every page from SGA would have
>> required 878GB+ for just the PTEs.
>
> Ok then use 1Gig pages or higher for a shared mapping of huge pages. I am not sure why there is a need for sharing page
> tables here. I just listened to your talk at the LSF/MM and noted some things.
>
>
> It may be best to follow established shared memory approaches like for example implemented already in shmem.
>
> If you want to do it with actually sharing page table semantics then the proper implementation using shmem would be
> maybe to add an additional flag. Lets call this O_SHARED_PAGE_TABLE for now.
>
> Then you would do
>
> fd = shmem_open("shared_pagetable_segment", O_CREATE|O_RDWR|O_SHARED_PAGE_TABLE, 0666);
>
> The remaining handling is straightforward and the shmem subsystem already provides consistent handling of shared memory
> segments.
>
> What you would have to do is to sort out the kernel internal problems created by sharing page table sections when using
> SHM vmas. But with that there are only limited changes required to special types of vma and the shmem subsystem. So the
> impact on the kernel overall is limited and you are following an established method of managing shared memory.
>
> I actually need something like shared page tables also for another in kernel page table use case in order to define
> sections in kernel virtual memory that are special for cpus or nodes. Some abstracted functions to manage page tables
> that share pgd,pud,pmd would be good to have in the kernel if you dont mind.
>
> But for this use case I'd suggest to use gigabyte shmem mappings and be done with it.
>
> https://lwn.net/Articles/375098/
Hello Christoph,
Thanks for the feedback. Yes, shmem can address this specific case and a solution using shmem with hugepages is in use
currently. Two issues with that - (1) it addresses only this specific problem and does not address page table sharing in
a general case which from hearing from many other people is indeed needed, (2) hugepages have to be pre-allocated which
is not a flexible solution. Even though hugepages can be added at any time, kernel does it on best effort basis and
latency to get the required number of hugepages can be unpredictable. So a more general solution that does not depend
upon hugepages can be more useful in the long run and it can help other cases as well, like yours.
Thanks,
Khalid
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-05-17 21:24 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-28 22:56 [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare) Khalid Aziz
2024-02-29 9:21 ` David Hildenbrand
2024-02-29 14:12 ` Matthew Wilcox
2024-02-29 15:15 ` David Hildenbrand
2024-03-04 16:45 ` Khalid Aziz
2024-03-25 17:57 ` David Hildenbrand
2024-05-14 18:21 ` Christoph Lameter (Ampere)
2024-05-17 21:23 ` Khalid Aziz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).