RFC for new feature to move pages from one vma to another without split

From: Lokesh Gidra <lokeshgidra@google.com>
To: Peter Xu <peterx@redhat.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Nicolas Geoffray <ngeoffray@google.com>,
	Jared Duke <jdduke@google.com>,
	android-mm <android-mm@google.com>
Subject: RFC for new feature to move pages from one vma to another without split
Date: Thu, 16 Feb 2023 14:27:11 -0800	[thread overview]
Message-ID: <CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com> (raw)

I) SUMMARY:
Requesting comments on a new feature which remaps pages from one
private anonymous mapping to another, without altering the vmas
involved. Two alternatives exist but both have drawbacks:
1. userfaultfd ioctls allocate new pages, copy data and free the old
ones even when updates could be done in-place;
2. mremap results in vma splitting in most of the cases due to 'pgoff' mismatch.

Proposing a new mremap flag or userfaultfd ioctl which enables
remapping pages without these drawbacks. Such a feature, as described
below, would be very helpful in efficient implementation of concurrent
compaction algorithms.

II) MOTIVATION:
Garbage collectors (like the ones used in managed languages) perform
defragmentation of the managed heap by moving objects (of varying
sizes) within the heap. Usually these algorithms have to be concurrent
to avoid response time concerns. These are concurrent in the sense
that while the GC threads are compacting the heap, application threads
continue to make progress, which means enabling access to the heap
while objects are being simultaneously moved.

Given the high overhead of heap compaction, such algorithms typically
segregate the heap into two types of regions (set of contiguous
pages): those that have enough fragmentation to compact, and those
that are densely populated. While only ‘fragmented’ regions are
compacted by sliding objects, both types of regions are traversed to
update references in them to the moved objects.

A) PROT_NONE+SIGSEGV approach:
One of the widely used techniques to ensure data integrity during
concurrent compaction is to use page-level access interception.
Traditionally, this is implemented by mprotecting (PROT_NONE) the heap
before starting compaction and installing a SIGSEGV handler. When GC
threads are compacting the heap, if some application threads fault on
the heap, then they compact the faulted page in the SIGSEGV handler
and then enable access to it before returning. To do this atomically,
the heap must use shmem (MAP_SHARED) so that an alias mapping (with
read-write permission) can be used for moving objects into and
updating references.

Limitation: due to different access rights, the heap can end up with
one vma per page in the worst case, hitting the ‘max_map_count’ limit.

B) Userfaultfd approach:
Userfaultfd avoids the vma split issue by intercepting page-faults
when the page is missing and gives control to user-space to map the
desired content. It doesn’t affect the vma properties. The compaction
algorithm in this case works by first remapping the heap pages (using
mremap) to a secondary mapping and then registering the heap with
userfaultfd for MISSING faults. When an application thread accesses a
page that has not yet been mapped (by other GC/application threads), a
userfault occurs, and as a consequence the corresponding page is
generated and mapped using one of the following two ioctls.
1) COPY ioctl: Typically the heap would be private anonymous in this
case. For every page on the heap, compact the objects into a
page-sized buffer, which COPY ioctl takes as input. The ioctl
allocates a new page, copies the input buffer to it, and then maps it.
This means that even for updating references in the densely populated
regions (where compaction is not done), in-place updation is
impossible. This results in unnecessary page-clear, memcpy and
freeing.
2) CONTINUE ioctl: the two mappings (heap and secondary) are
MAP_SHARED to the same shmem file. Userfaults in the ‘fragmented’
regions are MISSING, in which case objects are compacted into the
corresponding secondary mapping page (which triggers a regular page
fault to get a page mapped) and then CONTINUE ioctl is invoked, which
maps the same page on the heap mapping. On the other hand, userfaults
in the ‘densely populated’ regions are MINOR (as the page already
exists in the secondary mapping), in which case we update the
references in the already existing page on the secondary mapping and
then invoke CONTINUE ioctl.

Limitation: we observed in our implementation that
page-faults/page-allocation, memcpy, and madvise took (with either of
the two ioctls) ~50% of the time spent in compaction.

III) USE CASE (of the proposed feature):
The proposed feature of moving pages from one vma to another will
enable us to:
A) Recycle pages entirely in the userspace as they are freed (pages
whose objects are already consumed as part of the current compaction
cycle) in the ‘fragmented’ regions. This way we avoid page-clearing
(during page allocation) and memcpy (in the kernel). When the page is
handed over to the kernel for remapping, there is nothing else needed
to be done. Furthermore, since the page is being reused, it doesn’t
have to be freed either.
B) Implement a coarse-grained page-level compaction algorithm wherein
pages containing live objects are slid next to each other without
touching them, while reclaiming in-between pages which contain only
garbage. Such an algorithm is very useful for compacting objects which
are seldom accessed by application and hence are likely to be swapped
out. Without this feature, this would require copying the pages
containing live objects, for which the src pages have to be
swapped-in, only to be soon swapped-out afterwards.

AFAIK, none of the above features can be implemented using mremap
(with current flags), irrespective of whether the heap is a shmem or
private anonymous mapping, because:
1) When moving a page it’s likely that its index will need to change
and mremapping such a page would result in VMA splitting.
2) Using mremap for moving pages would result in the heap’s range
being covered by several vmas. The mremap in the next compaction cycle
(required prior to starting compaction as described above), will fail
with EFAULT. This is because the src range in mremap is not allowed to
span multiple vmas. On the other hand, calling it for each src vma is
not feasible because:
  a) It’s not trivial to identify various vmas covering the heap range
in userspace, and
  b) This operation is supposed to happen with application threads
paused. Invoking numerous mremap syscalls in a pause risks causing
janks.
3) Mremap has scalability concerns due to the need to acquire mmap_sem
exclusively for splitting/merging VMAs. This would impact parallelism
of application threads, particularly during the beginning of the
compaction process when they are expected to cause a spurt of
userfaults.

IV) PROPOSAL:
Initially, maybe the feature can be implemented only for private
anonymous mappings. There are two ways this can be implemented:
A) A new userfaultfd ioctl, ‘MOVE’, which takes the same inputs as the
‘COPY’ ioctl. After sanity check, the ioctl would detach the pte
entries from the src vma, and move them to dst vma while updating
their ‘mapping’ and ‘index’ fields, if required.

B) Add a new flag to mremap, ‘MREMAP_ONLYPAGES’, which works similar
to the MOVE ioctl above.

Assuming (A) is implemented, here is broadly how the compaction would work:
* For a MISSING userfault in the ‘densely populated’ regions, update
pointers in-place in the secondary mapping page corresponding to the
fault address (on the heap) and then use the MOVE ioctl to map it on
the heap. In this case the ‘index’ field would remain the same.
* For a MISSING userfault in ‘fragmented’ regions, pick any freed page
in the secondary map, compact the objects corresponding to the fault
address in this page and then use MOVE ioctl to map it on the fault
address in the heap. This would require updating the ‘index’ field.
After compaction is completed, use madvise(MADV_DONTNEED) on the
secondary mapping to free any remaining pages.

Thanks,
Lokesh