All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Ray Fucillo <Ray.Fucillo@intersystems.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: scalability regressions related to hugetlb_fault() changes
Date: Thu, 24 Mar 2022 15:41:55 -0700	[thread overview]
Message-ID: <e4cfa252-7be8-48b2-9f19-019bcc0038af@oracle.com> (raw)
In-Reply-To: <43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org>

On 3/24/22 14:55, Randy Dunlap wrote:
> [add linux-mm mailing list]
> 
> On 3/24/22 13:12, Ray Fucillo wrote:
>> In moving to newer versions of the kernel, our customers have experienced dramatic new scalability problems in our database application, InterSystems IRIS.  Our research has narrowed this down to new processes that attach to the database's shared memory segment taking very long delays (in some cases ~100ms!) acquiring the i_mmap_lock_read() in hugetlb_fault() as they fault in the huge page for the first time.  The addition of this lock in hugetlb_fault() matches the versions where we see this problem.  It's not just slowing the new process that incurs the delay, but backing up other processes if the page fault occurs inside a critical section within the database application.
>>
>> Is there something that can be improved here?  
>>
>> The read locks in hugetlb_fault() contend with write locks that seem to be taken in very common application code paths: shmat(), process exit, fork() (not vfork()), shmdt(), presumably others.  So hugetlb_fault() contending to read turns out to be common.  When the system is loaded, there will be many new processes faulting in pages that may blocks the write lock, which in turn blocks more readers in fault behind it, and so on...  I don't think there's any support for shared page tables in hugetlb to avoid the faults altogether.
>>
>> Switching to 1GB huge pages instead of 2MB is a good mitigation in reducing the frequency of fault, but not a complete solution.
>>
>> Thanks for considering.
>>
>> Ray

Hi Ray,

Acquiring i_mmap_rwsem in hugetlb_fault was added in the v5.7 kernel with
commit c0d0381ade79 "hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization".  Ironically, this was added due to correctness (possible
data corruption) issues with huge pmd sharing (shared page tables for hugetlb
at the pmd level).  It is used to synchronize the fault path which sets up
the sharing with the unmap (or other) path which tears down the sharing.

As mentioned in the commit message, it is 'possible' to approach this issue
in different ways such as catch races, cleanup, backout and retry.  Adding
the synchronization seemed to be the most direct and less error prone
approach.  I also seem to remember thinking about the possibility of
avoiding the synchronization if pmd sharing was not possible.  That may be
a relatively easy way to speed things up.  Not sure if pmd sharing comes
into play in your customer environments, my guess would be yes (shared
mappings ranges more than 1GB in size and aligned to 1GB).

It has been a couple years since c0d0381ade79, I will take some time to
look into alternatives and/or improvements.

Also, do you have any specifics about the regressions your customers are
seeing?  Specifically what paths are holding i_mmap_rwsem in write mode
for long periods of time.  I would expect something related to unmap.
Truncation can have long hold times especially if there are may shared
mapping.  Always worth checking specifics, but more likely this is a general
issue.
-- 
Mike Kravetz

  reply	other threads:[~2022-03-24 22:42 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-24 20:12 scalability regressions related to hugetlb_fault() changes Ray Fucillo
2022-03-24 21:55 ` Randy Dunlap
2022-03-24 22:41   ` Mike Kravetz [this message]
2022-03-25  0:02     ` Ray Fucillo
2022-03-25  4:40       ` Mike Kravetz
2022-03-25 13:33         ` Ray Fucillo
2022-03-28 18:30           ` Mike Kravetz
2022-03-30 12:22 ` Thorsten Leemhuis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e4cfa252-7be8-48b2-9f19-019bcc0038af@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=Ray.Fucillo@intersystems.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.