> Given that mremap is holding mmap_sem exclusively, how about userspace
> malloc implementation taking some exclusive malloc lock and doing
> normal mremap followed by mmap with MAP_FIXED to fill the hole ? It
> might end up having largely same overhead. Well, modulo some extra TLB
> flushing. But arguably, reducing TLB flushes for sequence of page
> table updates could be usefully addressed separately (e.g. maybe by
> matching those syscalls, maybe via syslets).

You can't use MAP_FIXED because it has a race with other users of mmap.

The address hint will *usually* work, but you need to deal with the case
where it fails and then cope with the fallout of the fragmentation.

PaX ASLR ignores address hints so that's something else to consider if
you care about running on PaX/Grsecurity patched kernels.

I'm doing this in my own allocator that's heavily based on the jemalloc
design. It just unmaps the memory given by the hinted mmap call if it
fails to get back the hole:

https://github.com/thestinger/allocator/blob/e80d2d0c2863c490b650ecffeb33beaccfcfdc46/huge.c#L167-L180

On 64-bit, it relies on 1TiB of reserved address space (works even with
overcommit disabled) to do per-CPU allocation for chunks and huge (>=
chunk size) allocations via address range checks so it also needs this
ugly workaround too:

https://github.com/thestinger/allocator/blob/e80d2d0c2863c490b650ecffeb33beaccfcfdc46/huge.c#L67-L75

I'm convinced that the mmap_sem writer lock can be avoided for the case
with MREMAP_FIXED via a good heuristic though. It just needs to check
that dst is a single VMA that matches the src properties and fall back
to the writer lock if that's not the case. This will have the same
performance as a separate syscall to move pages in all the cases where
that syscall would work.