From: Qian Cai <email@example.com> To: Waiman Long <firstname.lastname@example.org> Cc: Mike Kravetz <email@example.com>, Matthew Wilcox <firstname.lastname@example.org>, Peter Zijlstra <email@example.com>, Ingo Molnar <firstname.lastname@example.org>, Will Deacon <email@example.com>, Alexander Viro <firstname.lastname@example.org>, email@example.com, firstname.lastname@example.org, email@example.com, Davidlohr Bueso <firstname.lastname@example.org> Subject: Re: [PATCH 5/5] hugetlbfs: Limit wait time when trying to share huge PMD Date: Wed, 11 Sep 2019 13:22:47 -0400 Message-ID: <C29A1EFA-148C-454E-91F1-93D5116FB640@lca.pw> (raw) In-Reply-To: <email@example.com> > On Sep 11, 2019, at 1:15 PM, Waiman Long <firstname.lastname@example.org> wrote: > > On 9/11/19 6:03 PM, Mike Kravetz wrote: >> On 9/11/19 8:44 AM, Waiman Long wrote: >>> On 9/11/19 4:14 PM, Matthew Wilcox wrote: >>>> On Wed, Sep 11, 2019 at 04:05:37PM +0100, Waiman Long wrote: >>>>> When allocating a large amount of static hugepages (~500-1500GB) on a >>>>> system with large number of CPUs (4, 8 or even 16 sockets), performance >>>>> degradation (random multi-second delays) was observed when thousands >>>>> of processes are trying to fault in the data into the huge pages. The >>>>> likelihood of the delay increases with the number of sockets and hence >>>>> the CPUs a system has. This only happens in the initial setup phase >>>>> and will be gone after all the necessary data are faulted in. >>>> Can;t the application just specify MAP_POPULATE? >>> Originally, I thought that this happened in the startup phase when the >>> pages were faulted in. The problem persists after steady state had been >>> reached though. Every time you have a new user process created, it will >>> have its own page table. >> This is still at fault time. Although, for the particular application it >> may be after the 'startup phase'. >> >>> It is the sharing of the of huge page shared >>> memory that is causing problem. Of course, it depends on how the >>> application is written. >> It may be the case that some applications would find the delays acceptable >> for the benefit of shared pmds once they reach steady state. As you say, of >> course this depends on how the application is written. >> >> I know that Oracle DB would not like it if PMD sharing is disabled for them. >> Based on what I know of their model, all processes which share PMDs perform >> faults (write or read) during the startup phase. This is in environments as >> big or bigger than you describe above. I have never looked at/for delays in >> these environments around pmd sharing (page faults), but that does not mean >> they do not exist. I will try to get the DB group to give me access to one >> of their large environments for analysis. >> >> We may want to consider making the timeout value and disable threshold user >> configurable. > > Making it configurable is certainly doable. They can be sysctl > parameters so that the users can reenable PMD sharing by making those > parameters larger. It could be a Kconfig option, so people don’t need to change the setting every time after reinstalling the system. There are times people don’t care too much about those random multi-second delays. For example, running a debug kernel.
next prev parent reply index Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-09-11 15:05 [PATCH 0/5] hugetlbfs: Disable PMD sharing for large systems Waiman Long 2019-09-11 15:05 ` [PATCH 1/5] locking/rwsem: Add down_write_timedlock() Waiman Long 2019-09-11 15:05 ` [PATCH 2/5] locking/rwsem: Enable timeout check when spinning on owner Waiman Long 2019-09-11 15:05 ` [PATCH 3/5] locking/osq: Allow early break from OSQ Waiman Long 2019-09-11 15:05 ` [PATCH 4/5] locking/rwsem: Enable timeout check when staying in the OSQ Waiman Long 2019-09-11 15:05 ` [PATCH 5/5] hugetlbfs: Limit wait time when trying to share huge PMD Waiman Long 2019-09-11 15:14 ` Matthew Wilcox 2019-09-11 15:44 ` Waiman Long 2019-09-11 17:03 ` Mike Kravetz 2019-09-11 17:15 ` Waiman Long 2019-09-11 17:22 ` Qian Cai [this message] 2019-09-11 17:28 ` Waiman Long 2019-09-11 16:01 ` Qian Cai 2019-09-11 16:34 ` Waiman Long 2019-09-11 19:42 ` Qian Cai 2019-09-11 20:54 ` Waiman Long 2019-09-11 21:57 ` Qian Cai 2019-09-11 19:57 ` Matthew Wilcox 2019-09-11 20:51 ` Waiman Long 2019-09-12 3:26 ` Mike Kravetz 2019-09-12 3:41 ` Matthew Wilcox 2019-09-12 4:40 ` Davidlohr Bueso 2019-09-16 13:53 ` Waiman Long 2019-09-12 9:06 ` Waiman Long 2019-09-12 16:43 ` Mike Kravetz 2019-09-13 18:23 ` Waiman Long 2019-09-13 1:50 ` [PATCH 0/5] hugetlbfs: Disable PMD sharing for large systems Dave Chinner 2019-09-25 8:35 ` Peter Zijlstra
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=C29A1EFA-148C-454E-91F1-93D5116FB640@lca.pw \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-Fsdevel Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \ email@example.com public-inbox-index linux-fsdevel Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel AGPL code for this site: git clone https://public-inbox.org/public-inbox.git