Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Waiman Long <longman@redhat.com>, Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>, Will Deacon <will.deacon@arm.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Davidlohr Bueso <dave@stgolabs.net>
Subject: Re: [PATCH 5/5] hugetlbfs: Limit wait time when trying to share huge PMD
Date: Wed, 11 Sep 2019 10:03:16 -0700
Message-ID: <40a511a4-5771-f9a9-40b6-64e39478bbcb@oracle.com> (raw)
In-Reply-To: <19d9ea18-bd20-e02f-c1de-70e7322f5f22@redhat.com>

On 9/11/19 8:44 AM, Waiman Long wrote:
> On 9/11/19 4:14 PM, Matthew Wilcox wrote:
>> On Wed, Sep 11, 2019 at 04:05:37PM +0100, Waiman Long wrote:
>>> When allocating a large amount of static hugepages (~500-1500GB) on a
>>> system with large number of CPUs (4, 8 or even 16 sockets), performance
>>> degradation (random multi-second delays) was observed when thousands
>>> of processes are trying to fault in the data into the huge pages. The
>>> likelihood of the delay increases with the number of sockets and hence
>>> the CPUs a system has.  This only happens in the initial setup phase
>>> and will be gone after all the necessary data are faulted in.
>> Can;t the application just specify MAP_POPULATE?
> 
> Originally, I thought that this happened in the startup phase when the
> pages were faulted in. The problem persists after steady state had been
> reached though. Every time you have a new user process created, it will
> have its own page table.

This is still at fault time.  Although, for the particular application it
may be after the 'startup phase'.

>                          It is the sharing of the of huge page shared
> memory that is causing problem. Of course, it depends on how the
> application is written.

It may be the case that some applications would find the delays acceptable
for the benefit of shared pmds once they reach steady state.  As you say, of
course this depends on how the application is written.

I know that Oracle DB would not like it if PMD sharing is disabled for them.
Based on what I know of their model, all processes which share PMDs perform
faults (write or read) during the startup phase.  This is in environments as
big or bigger than you describe above.  I have never looked at/for delays in
these environments around pmd sharing (page faults), but that does not mean
they do not exist.  I will try to get the DB group to give me access to one
of their large environments for analysis.

We may want to consider making the timeout value and disable threshold user
configurable.
-- 
Mike Kravetz

  reply index

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-11 15:05 [PATCH 0/5] hugetlbfs: Disable PMD sharing for large systems Waiman Long
2019-09-11 15:05 ` [PATCH 1/5] locking/rwsem: Add down_write_timedlock() Waiman Long
2019-09-11 15:05 ` [PATCH 2/5] locking/rwsem: Enable timeout check when spinning on owner Waiman Long
2019-09-11 15:05 ` [PATCH 3/5] locking/osq: Allow early break from OSQ Waiman Long
2019-09-11 15:05 ` [PATCH 4/5] locking/rwsem: Enable timeout check when staying in the OSQ Waiman Long
2019-09-11 15:05 ` [PATCH 5/5] hugetlbfs: Limit wait time when trying to share huge PMD Waiman Long
2019-09-11 15:14   ` Matthew Wilcox
2019-09-11 15:44     ` Waiman Long
2019-09-11 17:03       ` Mike Kravetz [this message]
2019-09-11 17:15         ` Waiman Long
2019-09-11 17:22           ` Qian Cai
2019-09-11 17:28           ` Waiman Long
2019-09-11 16:01   ` Qian Cai
2019-09-11 16:34     ` Waiman Long
2019-09-11 19:42       ` Qian Cai
2019-09-11 20:54         ` Waiman Long
2019-09-11 21:57           ` Qian Cai
2019-09-11 19:57   ` Matthew Wilcox
2019-09-11 20:51     ` Waiman Long
2019-09-12  3:26   ` Mike Kravetz
2019-09-12  3:41     ` Matthew Wilcox
2019-09-12  4:40       ` Davidlohr Bueso
2019-09-16 13:53         ` Waiman Long
2019-09-12  9:06     ` Waiman Long
2019-09-12 16:43       ` Mike Kravetz
2019-09-13 18:23         ` Waiman Long
2019-09-13  1:50 ` [PATCH 0/5] hugetlbfs: Disable PMD sharing for large systems Dave Chinner
2019-09-25  8:35   ` Peter Zijlstra

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=40a511a4-5771-f9a9-40b6-64e39478bbcb@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=dave@stgolabs.net \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=will.deacon@arm.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git