All of lore.kernel.org
 help / color / mirror / Atom feed
* scalability regressions related to hugetlb_fault() changes
@ 2022-03-24 20:12 Ray Fucillo
  2022-03-24 21:55 ` Randy Dunlap
  2022-03-30 12:22 ` Thorsten Leemhuis
  0 siblings, 2 replies; 8+ messages in thread
From: Ray Fucillo @ 2022-03-24 20:12 UTC (permalink / raw)
  To: linux-kernel

In moving to newer versions of the kernel, our customers have experienced dramatic new scalability problems in our database application, InterSystems IRIS.  Our research has narrowed this down to new processes that attach to the database's shared memory segment taking very long delays (in some cases ~100ms!) acquiring the i_mmap_lock_read() in hugetlb_fault() as they fault in the huge page for the first time.  The addition of this lock in hugetlb_fault() matches the versions where we see this problem.  It's not just slowing the new process that incurs the delay, but backing up other processes if the page fault occurs inside a critical section within the database application.

Is there something that can be improved here?  

The read locks in hugetlb_fault() contend with write locks that seem to be taken in very common application code paths: shmat(), process exit, fork() (not vfork()), shmdt(), presumably others.  So hugetlb_fault() contending to read turns out to be common.  When the system is loaded, there will be many new processes faulting in pages that may blocks the write lock, which in turn blocks more readers in fault behind it, and so on...  I don't think there's any support for shared page tables in hugetlb to avoid the faults altogether.

Switching to 1GB huge pages instead of 2MB is a good mitigation in reducing the frequency of fault, but not a complete solution.

Thanks for considering.

Ray

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: scalability regressions related to hugetlb_fault() changes
  2022-03-24 20:12 scalability regressions related to hugetlb_fault() changes Ray Fucillo
@ 2022-03-24 21:55 ` Randy Dunlap
  2022-03-24 22:41   ` Mike Kravetz
  2022-03-30 12:22 ` Thorsten Leemhuis
  1 sibling, 1 reply; 8+ messages in thread
From: Randy Dunlap @ 2022-03-24 21:55 UTC (permalink / raw)
  To: Ray Fucillo, linux-kernel, linux-mm

[add linux-mm mailing list]

On 3/24/22 13:12, Ray Fucillo wrote:
> In moving to newer versions of the kernel, our customers have experienced dramatic new scalability problems in our database application, InterSystems IRIS.  Our research has narrowed this down to new processes that attach to the database's shared memory segment taking very long delays (in some cases ~100ms!) acquiring the i_mmap_lock_read() in hugetlb_fault() as they fault in the huge page for the first time.  The addition of this lock in hugetlb_fault() matches the versions where we see this problem.  It's not just slowing the new process that incurs the delay, but backing up other processes if the page fault occurs inside a critical section within the database application.
> 
> Is there something that can be improved here?  
> 
> The read locks in hugetlb_fault() contend with write locks that seem to be taken in very common application code paths: shmat(), process exit, fork() (not vfork()), shmdt(), presumably others.  So hugetlb_fault() contending to read turns out to be common.  When the system is loaded, there will be many new processes faulting in pages that may blocks the write lock, which in turn blocks more readers in fault behind it, and so on...  I don't think there's any support for shared page tables in hugetlb to avoid the faults altogether.
> 
> Switching to 1GB huge pages instead of 2MB is a good mitigation in reducing the frequency of fault, but not a complete solution.
> 
> Thanks for considering.
> 
> Ray

-- 
~Randy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: scalability regressions related to hugetlb_fault() changes
  2022-03-24 21:55 ` Randy Dunlap
@ 2022-03-24 22:41   ` Mike Kravetz
  2022-03-25  0:02     ` Ray Fucillo
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Kravetz @ 2022-03-24 22:41 UTC (permalink / raw)
  To: Ray Fucillo, linux-kernel, linux-mm

On 3/24/22 14:55, Randy Dunlap wrote:
> [add linux-mm mailing list]
> 
> On 3/24/22 13:12, Ray Fucillo wrote:
>> In moving to newer versions of the kernel, our customers have experienced dramatic new scalability problems in our database application, InterSystems IRIS.  Our research has narrowed this down to new processes that attach to the database's shared memory segment taking very long delays (in some cases ~100ms!) acquiring the i_mmap_lock_read() in hugetlb_fault() as they fault in the huge page for the first time.  The addition of this lock in hugetlb_fault() matches the versions where we see this problem.  It's not just slowing the new process that incurs the delay, but backing up other processes if the page fault occurs inside a critical section within the database application.
>>
>> Is there something that can be improved here?  
>>
>> The read locks in hugetlb_fault() contend with write locks that seem to be taken in very common application code paths: shmat(), process exit, fork() (not vfork()), shmdt(), presumably others.  So hugetlb_fault() contending to read turns out to be common.  When the system is loaded, there will be many new processes faulting in pages that may blocks the write lock, which in turn blocks more readers in fault behind it, and so on...  I don't think there's any support for shared page tables in hugetlb to avoid the faults altogether.
>>
>> Switching to 1GB huge pages instead of 2MB is a good mitigation in reducing the frequency of fault, but not a complete solution.
>>
>> Thanks for considering.
>>
>> Ray

Hi Ray,

Acquiring i_mmap_rwsem in hugetlb_fault was added in the v5.7 kernel with
commit c0d0381ade79 "hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization".  Ironically, this was added due to correctness (possible
data corruption) issues with huge pmd sharing (shared page tables for hugetlb
at the pmd level).  It is used to synchronize the fault path which sets up
the sharing with the unmap (or other) path which tears down the sharing.

As mentioned in the commit message, it is 'possible' to approach this issue
in different ways such as catch races, cleanup, backout and retry.  Adding
the synchronization seemed to be the most direct and less error prone
approach.  I also seem to remember thinking about the possibility of
avoiding the synchronization if pmd sharing was not possible.  That may be
a relatively easy way to speed things up.  Not sure if pmd sharing comes
into play in your customer environments, my guess would be yes (shared
mappings ranges more than 1GB in size and aligned to 1GB).

It has been a couple years since c0d0381ade79, I will take some time to
look into alternatives and/or improvements.

Also, do you have any specifics about the regressions your customers are
seeing?  Specifically what paths are holding i_mmap_rwsem in write mode
for long periods of time.  I would expect something related to unmap.
Truncation can have long hold times especially if there are may shared
mapping.  Always worth checking specifics, but more likely this is a general
issue.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: scalability regressions related to hugetlb_fault() changes
  2022-03-24 22:41   ` Mike Kravetz
@ 2022-03-25  0:02     ` Ray Fucillo
  2022-03-25  4:40       ` Mike Kravetz
  0 siblings, 1 reply; 8+ messages in thread
From: Ray Fucillo @ 2022-03-25  0:02 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Ray Fucillo, linux-kernel, linux-mm


> On Mar 24, 2022, at 6:41 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> I also seem to remember thinking about the possibility of
> avoiding the synchronization if pmd sharing was not possible.  That may be
> a relatively easy way to speed things up.  Not sure if pmd sharing comes
> into play in your customer environments, my guess would be yes (shared
> mappings ranges more than 1GB in size and aligned to 1GB).

Hi Mike, 

This is one very large shared memory segment allocated at database startup.  It's common for it to be hundreds of GB.  We allocate it with shmget() passing SHM_HUGETLB (when huge pages have been reserved for us).  Not sure if that answers...

> Also, do you have any specifics about the regressions your customers are
> seeing?  Specifically what paths are holding i_mmap_rwsem in write mode
> for long periods of time.  I would expect something related to unmap.
> Truncation can have long hold times especially if there are may shared
> mapping.  Always worth checking specifics, but more likely this is a general
> issue.

We've seen the write lock originate from calling shmat(), shmdt() and process exit.  We've also seen it from a fork() off of one of the processes that are attached to the shared memory segment.  Some evidence suggests that fork is a more costly case.  However, while there are some important places where we'd use fork(), it's more unusual because most process creation will vfork() and execv() a new database process (which then attaches with shmat()).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: scalability regressions related to hugetlb_fault() changes
  2022-03-25  0:02     ` Ray Fucillo
@ 2022-03-25  4:40       ` Mike Kravetz
  2022-03-25 13:33         ` Ray Fucillo
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Kravetz @ 2022-03-25  4:40 UTC (permalink / raw)
  To: Ray Fucillo; +Cc: linux-kernel, linux-mm

On 3/24/22 17:02, Ray Fucillo wrote:
> 
>> On Mar 24, 2022, at 6:41 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> I also seem to remember thinking about the possibility of
>> avoiding the synchronization if pmd sharing was not possible.  That may be
>> a relatively easy way to speed things up.  Not sure if pmd sharing comes
>> into play in your customer environments, my guess would be yes (shared
>> mappings ranges more than 1GB in size and aligned to 1GB).
> 
> Hi Mike, 
> 
> This is one very large shared memory segment allocated at database startup.  It's common for it to be hundreds of GB.  We allocate it with shmget() passing SHM_HUGETLB (when huge pages have been reserved for us).  Not sure if that answers...

Yes, so there would be shared pmds for that large shared mapping.  I assume
this is x86 or arm64 which are the only architectures which support shared
pmds.

So, the easy change of "don't take semaphore if pmd sharing is not possible"
would not apply.

>> Also, do you have any specifics about the regressions your customers are
>> seeing?  Specifically what paths are holding i_mmap_rwsem in write mode
>> for long periods of time.  I would expect something related to unmap.
>> Truncation can have long hold times especially if there are may shared
>> mapping.  Always worth checking specifics, but more likely this is a general
>> issue.
> 
> We've seen the write lock originate from calling shmat(), shmdt() and process exit.  We've also seen it from a fork() off of one of the processes that are attached to the shared memory segment.  Some evidence suggests that fork is a more costly case.  However, while there are some important places where we'd use fork(), it's more unusual because most process creation will vfork() and execv() a new database process (which then attaches with shmat()).

Thanks.

I will continue to look at this.  A quick check of the fork code shows the
semaphore held in read mode for the duration of the page table copy.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: scalability regressions related to hugetlb_fault() changes
  2022-03-25  4:40       ` Mike Kravetz
@ 2022-03-25 13:33         ` Ray Fucillo
  2022-03-28 18:30           ` Mike Kravetz
  0 siblings, 1 reply; 8+ messages in thread
From: Ray Fucillo @ 2022-03-25 13:33 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Ray Fucillo, linux-kernel, linux-mm


> On Mar 25, 2022, at 12:40 AM, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> I will continue to look at this.  A quick check of the fork code shows the
> semaphore held in read mode for the duration of the page table copy.

Thank you for looking into it.  

As a side note about fork() for context, and not to distract from the 
regression at hand...  There's some history here where we ran into problems 
circa 2005 where fork time went linear with the size of shared memory, and 
that was resolved by letting the pages fault in the child.  This was when
hugetlb was pretty new (and not used by us) and I see now that the fix
explicitly excluded hugetlb.  Anyway, we now mostly use vfork(), only fork()
in some special cases, and improving just fork wouldn't fix the scalability
regression for us.  But, it does sound like fork() time might be getting 
large again now that everyone is using very large shared segments with 
hugetlb, but generally haven't switched to 1GB pages.  That old thread is: 
https://lkml.org/lkml/2005/8/24/190

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: scalability regressions related to hugetlb_fault() changes
  2022-03-25 13:33         ` Ray Fucillo
@ 2022-03-28 18:30           ` Mike Kravetz
  0 siblings, 0 replies; 8+ messages in thread
From: Mike Kravetz @ 2022-03-28 18:30 UTC (permalink / raw)
  To: Ray Fucillo
  Cc: linux-kernel, linux-mm, Michal Hocko, Naoya Horiguchi,
	'Aneesh Kumar',
	Kirill A. Shutemov

On 3/25/22 06:33, Ray Fucillo wrote:
> 
>> On Mar 25, 2022, at 12:40 AM, Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> I will continue to look at this.  A quick check of the fork code shows the
>> semaphore held in read mode for the duration of the page table copy.
> 
> Thank you for looking into it.  
> 

Adding some mm people on cc:
Just a quick update on some thoughts and possible approach.

Note that regressions were noted when code was originally added to take
i_mmap_rwsem at fault time.  A limited way of addressing the issue was
proposed here:
https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/

I do not think such a change would help in this case as hugetlb pages are used
via a shared memory segment.  Hence, sharing and pmd sharing is happening.

After some thought, I believe the synchronization needed for pmd sharing
as outlined in commit c0d0381ade79 is limited to a single address space/mm_struct.  We only need to worry about one thread of a process causing
an unshare while another thread in the same process is faulting.  That is
because the unshare only tears down the page tables in the calling process.
Also, the page table modifications associated pmd sharing are constrained
by the virtual address range of a vma describing the sharable area.
Therefore, pmd sharing synchronization can be done at the vma level.

My 'plan' is to hang a rw_sema off the vm_private_data of hugetlb vmas that
can possibly have shared pmds.  We will use this new semaphore instead of
i_mmap_rwsem at fault and pmd_unshare time.  The only time we should see
contention on this semaphore is if one thread of a process is doing something
to cause unsharing for an address range while another thread is faulting in
the same range.  This seems unlikely, and much much less common than one
process unmapping pages while another process wants to fault them in on a
large shared area.

There will also be a little code shuffling as the fault code is also
synchronized with truncation and hole punch via i_mmap_rwsem.  But, this is
much easier to address.

Comments or other suggestions welcome.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: scalability regressions related to hugetlb_fault() changes
  2022-03-24 20:12 scalability regressions related to hugetlb_fault() changes Ray Fucillo
  2022-03-24 21:55 ` Randy Dunlap
@ 2022-03-30 12:22 ` Thorsten Leemhuis
  1 sibling, 0 replies; 8+ messages in thread
From: Thorsten Leemhuis @ 2022-03-30 12:22 UTC (permalink / raw)
  To: Ray Fucillo, linux-kernel, regressions

[TLDR: I'm adding the regression report below to regzbot, the Linux
kernel regression tracking bot; all text you find below is compiled from
a few templates paragraphs you might have encountered already already
from similar mails.]

Hi, this is your Linux kernel regression tracker.

On 24.03.22 21:12, Ray Fucillo wrote:
> In moving to newer versions of the kernel, our customers have experienced dramatic new scalability problems in our database application, InterSystems IRIS.  Our research has narrowed this down to new processes that attach to the database's shared memory segment taking very long delays (in some cases ~100ms!) acquiring the i_mmap_lock_read() in hugetlb_fault() as they fault in the huge page for the first time.  The addition of this lock in hugetlb_fault() matches the versions where we see this problem.  It's not just slowing the new process that incurs the delay, but backing up other processes if the page fault occurs inside a critical section within the database application.
> 
> Is there something that can be improved here?  
> 
> The read locks in hugetlb_fault() contend with write locks that seem to be taken in very common application code paths: shmat(), process exit, fork() (not vfork()), shmdt(), presumably others.  So hugetlb_fault() contending to read turns out to be common.  When the system is loaded, there will be many new processes faulting in pages that may blocks the write lock, which in turn blocks more readers in fault behind it, and so on...  I don't think there's any support for shared page tables in hugetlb to avoid the faults altogether.
> 
> Switching to 1GB huge pages instead of 2MB is a good mitigation in reducing the frequency of fault, but not a complete solution.
> 
> Thanks for considering.
> 
> Ray

Thanks for the report. CCing the regression mailing list, as it should
be in the loop for all regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

To be sure below issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced c0d0381ade79
#regzbot title mm: scalability regressions related to hugetlb_fault()
changes
#regzbot ignore-activity
#regzbot back-burner: looks like this will take some time to get sorted out

If it turns out this isn't a regression, free free to remove it from the
tracking by sending a reply to this thread containing a paragraph like
"#regzbot invalid: reason why this is invalid" (without the quotes).

Reminder for developers: when fixing the issue, please add a 'Link:'
tags pointing to the report (the mail quoted above) using
lore.kernel.org/r/, as explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'. Regzbot needs them to
automatically connect reports with fixes, but they are useful in
general, too.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.

-- 
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
CC the regression list and tell regzbot about the issue, as that ensures
the regression makes it onto the radar of the Linux kernel's regression
tracker -- that's in your interest, as it ensures your report won't fall
through the cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include 'Link:' tag in the patch descriptions pointing to all reports
about the issue. This has been expected from developers even before
regzbot showed up for reasons explained in
'Documentation/process/submitting-patches.rst' and
'Documentation/process/5.Posting.rst'.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-03-30 12:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-24 20:12 scalability regressions related to hugetlb_fault() changes Ray Fucillo
2022-03-24 21:55 ` Randy Dunlap
2022-03-24 22:41   ` Mike Kravetz
2022-03-25  0:02     ` Ray Fucillo
2022-03-25  4:40       ` Mike Kravetz
2022-03-25 13:33         ` Ray Fucillo
2022-03-28 18:30           ` Mike Kravetz
2022-03-30 12:22 ` Thorsten Leemhuis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.