linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Micay <danielmicay@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>,
	Aliaksey Kandratsenka <alkondratenko@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Shaohua Li <shli@fb.com>,
	linux-mm@kvack.org, linux-api@vger.kernel.org,
	Rik van Riel <riel@redhat.com>, Hugh Dickins <hughd@google.com>,
	Mel Gorman <mel@csn.ul.ie>, Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>,
	Andy Lutomirski <luto@amacapital.net>,
	"google-perftools@googlegroups.com"
	<google-perftools@googlegroups.com>
Subject: Re: [PATCH] mremap: add MREMAP_NOHOLE flag --resend
Date: Wed, 25 Mar 2015 16:49:52 -0400	[thread overview]
Message-ID: <55131F70.7020503@gmail.com> (raw)
In-Reply-To: <5512E0C0.6060406@suse.cz>

[-- Attachment #1: Type: text/plain, Size: 3719 bytes --]

On 25/03/15 12:22 PM, Vlastimil Babka wrote:
> 
> I'm not sure I get your description right. The problem I know about is
> where "purging" means madvise(MADV_DONTNEED) and khugepaged later
> collapses a new hugepage that will repopulate the purged parts,
> increasing the memory usage. One can limit this via
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none . That
> setting doesn't affect the page fault THP allocations, which however
> happen only in newly accessed hugepage-sized areas and not partially
> purged ones, though.

Since jemalloc doesn't unmap memory but instead does recycling itself in
userspace, it ends up with large spans of free virtual memory and gets
*lots* of huge pages from the page fault heuristic. It keeps track of
active vs. dirty (not purged) vs. clean (purged / untouched) ranges
everywhere, and will purge dirty ranges as they build up.

The THP allocation on page faults mean it ends up with memory that's
supposed to be clean but is really not.

A worst case example with the (up until recently) default chunk size of
4M is allocating a bunch of 2.1M allocations. Chunks are naturally
aligned, so each one can be represented as 2 huge pages. It increases
memory usage by nearly *50%*. The allocator thinks the tail is clean
memory, but it's not. When the allocations are freed, it will purge the
2.1M at the head (once enough dirty memory builds up) but all of the
tail memory will be leaked until something else is allocated there and
then freed.

>> I think a THP implementation playing that played well with purging would
>> need to drop the page fault heuristic and rely on a significantly better
>> khugepaged.
> 
> See here http://lwn.net/Articles/636162/ (the "Compaction" part)
> 
> The objection is that some short-lived workloads like gcc have to map
> hugepages immediately if they are to benefit from them. I still plan to
> improve khugepaged and allow admins to say that they don't want THP page
> faults (and rely solely on khugepaged which has more information to
> judge additional memory usage), but I'm not sure if it would be an
> acceptable default behavior.
> One workaround in the current state for jemalloc and friends could be to
> use madvise(MADV_NOHUGEPAGE) on hugepage-sized/aligned areas where it
> wants to purge parts of them via madvise(MADV_DONTNEED). It could mean
> overhead of another syscall and tracking of where this was applied and
> when it makes sense to undo this and allow THP to be collapsed again,
> though, and it would also split vma's.

Huge pages do significantly help performance though, and this would
pretty much mean no huge pages. The overhead of toggling it on and off
based on whether it's a < chunk size allocation or a >= chunk size one
is too high.

The page fault heuristic is just way too aggressive because there's no
indication of how much memory will be used. I don't think it makes sense
to do it without an explicit MADV_NOHUGEPAGE. Collapsing only dense
ranges doesn't have the same risk.

>> This would mean faulting in a span of memory would no longer
>> be faster. Having a flag to populate a range with madvise would help a
> 
> If it's a newly mapped memory, there's mmap(MAP_POPULATE). There is also
> a madvise(MADV_WILLNEED), which sounds like what you want, but I don't
> know what the implementation does exactly - it was apparently added for
> paging in ahead, and maybe it ignores unpopulated anonymous areas, but
> it would probably be well in spirit of the flag to make it prepopulate
> those.

It doesn't seem to do anything for anon mappings atm but I do see a
patch from 2008 for that. I guess it never landed.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

  reply	other threads:[~2015-03-25 20:49 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-17 21:09 [PATCH] mremap: add MREMAP_NOHOLE flag --resend Shaohua Li
2015-03-18 22:31 ` Andrew Morton
2015-03-19  5:08   ` Shaohua Li
2015-03-19  5:22     ` Andrew Morton
2015-03-19 16:38       ` Shaohua Li
2015-03-19  5:34   ` Daniel Micay
2015-03-22  6:06     ` Aliaksey Kandratsenka
2015-03-22  7:22       ` Daniel Micay
2015-03-24  4:36         ` Aliaksey Kandratsenka
2015-03-24 14:54           ` Daniel Micay
2015-03-25 16:22         ` Vlastimil Babka
2015-03-25 20:49           ` Daniel Micay [this message]
2015-03-25 20:54             ` Daniel Micay
2015-03-26  0:19             ` David Rientjes
2015-03-26  0:24               ` Daniel Micay
2015-03-26  2:31                 ` David Rientjes
2015-03-26  3:24                   ` Daniel Micay
2015-03-26  3:36                     ` Daniel Micay
2015-03-26 17:25                     ` Vlastimil Babka
2015-03-26 20:45                       ` Daniel Micay
2015-03-23  5:17       ` Shaohua Li
2015-03-24  5:25         ` Aliaksey Kandratsenka
2015-03-24 14:39           ` Daniel Micay
2015-03-25  5:02             ` Shaohua Li
2015-03-26  0:50             ` Minchan Kim
2015-03-26  1:21               ` Daniel Micay
2015-03-26  7:02                 ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55131F70.7020503@gmail.com \
    --to=danielmicay@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=alkondratenko@gmail.com \
    --cc=google-perftools@googlegroups.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@amacapital.net \
    --cc=mel@csn.ul.ie \
    --cc=mhocko@suse.cz \
    --cc=riel@redhat.com \
    --cc=shli@fb.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).