Re: [PATCH] mremap: add MREMAP_NOHOLE flag --resend

From: Aliaksey Kandratsenka <alkondratenko@gmail.com>
To: Daniel Micay <danielmicay@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Shaohua Li <shli@fb.com>,
	linux-mm@kvack.org, linux-api@vger.kernel.org,
	Rik van Riel <riel@redhat.com>, Hugh Dickins <hughd@google.com>,
	Mel Gorman <mel@csn.ul.ie>, Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>,
	Andy Lutomirski <luto@amacapital.net>,
	"google-perftools@googlegroups.com"
	<google-perftools@googlegroups.com>
Subject: Re: [PATCH] mremap: add MREMAP_NOHOLE flag --resend
Date: Sat, 21 Mar 2015 23:06:14 -0700	[thread overview]
Message-ID: <CADpJO7zBLhjecbiQeTubnTReiicVLr0-K43KbB4uCL5w_dyqJg@mail.gmail.com> (raw)
In-Reply-To: <550A5FF8.90504@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9888 bytes --]

On Wed, Mar 18, 2015 at 10:34 PM, Daniel Micay <danielmicay@gmail.com>
wrote:
>
> On 18/03/15 06:31 PM, Andrew Morton wrote:
> > On Tue, 17 Mar 2015 14:09:39 -0700 Shaohua Li <shli@fb.com> wrote:
> >
> >> There was a similar patch posted before, but it doesn't get merged.
I'd like
> >> to try again if there are more discussions.
> >> http://marc.info/?l=linux-mm&m=141230769431688&w=2
> >>
> >> mremap can be used to accelerate realloc. The problem is mremap will
> >> punch a hole in original VMA, which makes specific memory allocator
> >> unable to utilize it. Jemalloc is an example. It manages memory in 4M
> >> chunks. mremap a range of the chunk will punch a hole, which other
> >> mmap() syscall can fill into. The 4M chunk is then fragmented, jemalloc
> >> can't handle it.
> >
> > Daniel's changelog had additional details regarding the userspace
> > allocators' behaviour.  It would be best to incorporate that into your
> > changelog.
> >
> > Daniel also had microbenchmark testing results for glibc and jemalloc.
> > Can you please do this?
> >
> > I'm not seeing any testing results for tcmalloc and I'm not seeing
> > confirmation that this patch will be useful for tcmalloc.  Has anyone
> > tried it, or sought input from tcmalloc developers?
>
> TCMalloc and jemalloc are currently equally slow in this benchmark, as
> neither makes use of mremap. They're ~2-3x slower than glibc. I CC'ed
> the currently most active TCMalloc developer so they can give input
> into whether this patch would let them use it.

Hi.

Thanks for looping us in for feedback (I'm CC-ing gperftools mailing list).

Yes, that might be useful feature. (Assuming I understood it correctly) I
believe
tcmalloc would likely use:

mremap(old_ptr, move_size, move_size,
       MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_NOHOLE,
       new_ptr);

as optimized equivalent of:

memcpy(new_ptr, old_ptr, move_size);
madvise(old_ptr, move_size, MADV_DONTNEED);

And btw I find MREMAP_RETAIN name from original patch to be slightly more
intuitive than MREMAP_NOHOLE. In my humble opinion the later name does not
reflect semantic of this feature at all (assuming of course I correctly
understood what the patch does).

I do have a couple of questions about this approach however. Please feel
free to
educate me on them.

a) what is the smallest size where mremap is going to be faster ?

My initial thinking was that we'd likely use mremap in all cases where we
know
that touching destination would cause minor page faults (i.e. when
destination
chunk was MADV_DONTNEED-ed or is brand new mapping). And then also always
when
size is large enough, i.e. because "teleporting" large count of pages is
likely
to be faster than copying them.

But now I realize that it is more interesting than that. I.e. because as
Daniel
pointed out, mremap holds mmap_sem exclusively, while page faults are
holding it
for read. That could be optimized of course. Either by separate "teleport
ptes"
syscall (again, as noted by Daniel), or by having mremap drop mmap_sem for
write
and retaking it for read for "moving pages" part of work. Being not really
familiar with kernel code I have no idea if that's doable or not. But it
looks
like it might be quite important.

Another aspect where I am similarly illiterate is performance effect of tlb
flushes needed for such operation.

We can certainly experiment and find that limit. But if mremap threshold is
going to be large, then perhaps this kernel feature is not as useful as we
may
hope.

b) is that optimization worth having at all ?

After all, memcpy is actually known to be fast. I understand that copying
memory
in user space can be slowed down by minor page faults (results below seem to
confirm that). But this is something where either allocator may retain
populated
pages a bit longer or where kernel could help. E.g. maybe by exposing
something
similar to MAP_POPULATE in madvise, or even doing some safe combination of
madvise and MAP_UNINITIALIZED.

I've played with Daniel's original benchmark (copied from
http://marc.info/?l=linux-mm&m=141230769431688&w=2) with some tiny
modifications:

#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>

int main(int argc, char **argv)
{
        if (argc > 1 && strcmp(argv[1], "--mlock") == 0) {
                int rv = mlockall(MCL_CURRENT | MCL_FUTURE);
                if (rv) {
                        perror("mlockall");
                        abort();
                }
                puts("mlocked!");
        }

        for (size_t i = 0; i < 64; i++) {
                void *ptr = NULL;
                size_t old_size = 0;
                for (size_t size = 4; size < (1 << 30); size *= 2) {
                        /*
                         * void *hole = malloc(1 << 20);
                         * if (!hole) {
                         *      perror("malloc");
                         *      abort();
                         * }
                         */
                        ptr = realloc(ptr, size);
                        if (!ptr) {
                                perror("realloc");
                                abort();
                        }
                        /* free(hole); */
                        memset(ptr + old_size, 0xff, size - old_size);
                        old_size = size;
                }
                free(ptr);
        }
}

I cannot say if this benchmark's vectors of up to 0.5 gigs are common in
important applications or not. It can be argued that apps that care about
such
large vectors can do mremap themselves.

On the other hand, I believe that this micro benchmark could be plausibly
changed to grow vector by smaller factor (i.e. see
https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md#memory-handling).
And
with smaller growth factor, is seems reasonable to expect larger overhead
from
memcpy and smaller overhead from mremap. And thus favor mremap more.

And I confirm that with all default settings tcmalloc and jemalloc lose to
glibc. Also, notably, recent dev build of jemalloc (what is going to be 4.0
AFAIK) actually matches or exceeds glibc speed, despite still not doing
mremap. Apparently it is smarter about avoiding moving allocation for those
realloc-s. And it was even able to resist my attempt to force it to move
allocation. I haven't investigated why. Note that I built it couple weeks
or so
ago from dev branch, so it might simply have bugs.

Results also vary greatly depending in transparent huge pages setting.
Here's
what I've got:

allocator |   mode    | time  | sys time | pgfaults |             extra
----------+-----------+-------+----------+----------+-------------------------------
glibc     |           | 10.75 |     8.44 |  8388770 |
glibc     |    thp    |  5.67 |     3.44 |   310882 |
glibc     |   mlock   | 13.22 |     9.41 |  8388821 |
glibc     | thp+mlock |  8.43 |     4.63 |   310933 |
tcmalloc  |           | 11.46 |     2.00 |  2104826 |
TCMALLOC_AGGRESSIVE_DECOMMIT=f
tcmalloc  |    thp    | 10.61 |     0.89 |   386206 |
TCMALLOC_AGGRESSIVE_DECOMMIT=f
tcmalloc  |   mlock   | 10.11 |     0.27 |   264721 |
TCMALLOC_AGGRESSIVE_DECOMMIT=f
tcmalloc  | thp+mlock | 10.28 |     0.17 |    46011 |
TCMALLOC_AGGRESSIVE_DECOMMIT=f
tcmalloc  |           | 23.63 |    17.16 | 16770107 |
TCMALLOC_AGGRESSIVE_DECOMMIT=t
tcmalloc  |    thp    | 11.82 |     5.14 |   352477 |
TCMALLOC_AGGRESSIVE_DECOMMIT=t
tcmalloc  |   mlock   | 10.10 |     0.28 |   264724 |
TCMALLOC_AGGRESSIVE_DECOMMIT=t
tcmalloc  | thp+mlock | 10.30 |     0.17 |    49168 |
TCMALLOC_AGGRESSIVE_DECOMMIT=t
jemalloc1 |           | 23.71 |    17.33 | 16744572 |
jemalloc1 |    thp    | 11.65 |     4.68 |    64988 |
jemalloc1 |   mlock   | 10.13 |     0.29 |   263305 |
jemalloc1 | thp+mlock | 10.05 |     0.17 |    50217 |
jemalloc2 |           | 10.87 |     8.64 |  8521796 |
jemalloc2 |    thp    |  4.64 |     2.32 |    56060 |
jemalloc2 |   mlock   |  4.22 |     0.28 |   263181 |
jemalloc2 | thp+mlock |  4.12 |     0.19 |    50411 |
----------+-----------+-------+----------+----------+-------------------------------

NOTE: usual disclaimer applies about possibility of screwing something up
and
getting invalid benchmark results without being able to see it. I apologize
in
advance.

NOTE: jemalloc1 is 3.6 as shipped by up-to-date Debian Sid. jemalloc2 is
home-built snapshot of upcoming jemalloc 4.0.

NOTE: TCMALLOC_AGGRESSIVE_DECOMMIT=t (and default since 2.4) makes tcmalloc
MADV_DONTNEED large free blocks immediately. As opposed to less rare with
setting of "false". And it makes big difference on page faults counts and
thus
on runtime.

Another notable thing is how mlock effectively disables MADV_DONTNEED for
jemalloc{1,2} and tcmalloc, lowers page faults count and thus improves
runtime. It can be seen that tcmalloc+mlock on thp-less configuration is
slightly better on runtime to glibc. The later spends a ton of time in
kernel,
probably handling minor page faults, and the former burns cpu in user space
doing memcpy-s. So "tons of memcpys" seems to be competitive to what glibc
is
doing in this benchmark.

THP changes things however. Where apparently minor page faults become a lot
cheaper. Which makes glibc case a lot faster than even tcmalloc+mlock case.
So
in THP case, cost of page faults is smaller than cost of large memcpy.

So results are somewhat mixed, but overall I'm not sure that I'm able to see
very convincing story for MREMAP_HOLE yet. However:

1) it is possible that I am missing something. If so, please, educate me.

2) if kernel implements this API, I'm going to use it in tcmalloc.

P.S. benchmark results also seem to indicate that tcmalloc could do
something to
explicitly enable THP and maybe better adapt to it's presence. Perhaps with
some
collaboration with kernel, i.e. to prevent that famous delay-ful-ness which
causes people to disable THP.

[-- Attachment #2: Type: text/html, Size: 11931 bytes --]