linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* RFC for new feature to move pages from one vma to another without split
@ 2023-02-16 22:27 Lokesh Gidra
  2023-04-06 17:29 ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Lokesh Gidra @ 2023-02-16 22:27 UTC (permalink / raw)
  To: Peter Xu, Axel Rasmussen, Andrew Morton,
	open list:MEMORY MANAGEMENT, linux-kernel, Andrea Arcangeli,
	Kirill A . Shutemov, Kirill A. Shutemov
  Cc: Brian Geffon, Suren Baghdasaryan, Kalesh Singh, Nicolas Geoffray,
	Jared Duke, android-mm

I) SUMMARY:
Requesting comments on a new feature which remaps pages from one
private anonymous mapping to another, without altering the vmas
involved. Two alternatives exist but both have drawbacks:
1. userfaultfd ioctls allocate new pages, copy data and free the old
ones even when updates could be done in-place;
2. mremap results in vma splitting in most of the cases due to 'pgoff' mismatch.

Proposing a new mremap flag or userfaultfd ioctl which enables
remapping pages without these drawbacks. Such a feature, as described
below, would be very helpful in efficient implementation of concurrent
compaction algorithms.


II) MOTIVATION:
Garbage collectors (like the ones used in managed languages) perform
defragmentation of the managed heap by moving objects (of varying
sizes) within the heap. Usually these algorithms have to be concurrent
to avoid response time concerns. These are concurrent in the sense
that while the GC threads are compacting the heap, application threads
continue to make progress, which means enabling access to the heap
while objects are being simultaneously moved.

Given the high overhead of heap compaction, such algorithms typically
segregate the heap into two types of regions (set of contiguous
pages): those that have enough fragmentation to compact, and those
that are densely populated. While only ‘fragmented’ regions are
compacted by sliding objects, both types of regions are traversed to
update references in them to the moved objects.

A) PROT_NONE+SIGSEGV approach:
One of the widely used techniques to ensure data integrity during
concurrent compaction is to use page-level access interception.
Traditionally, this is implemented by mprotecting (PROT_NONE) the heap
before starting compaction and installing a SIGSEGV handler. When GC
threads are compacting the heap, if some application threads fault on
the heap, then they compact the faulted page in the SIGSEGV handler
and then enable access to it before returning. To do this atomically,
the heap must use shmem (MAP_SHARED) so that an alias mapping (with
read-write permission) can be used for moving objects into and
updating references.

Limitation: due to different access rights, the heap can end up with
one vma per page in the worst case, hitting the ‘max_map_count’ limit.

B) Userfaultfd approach:
Userfaultfd avoids the vma split issue by intercepting page-faults
when the page is missing and gives control to user-space to map the
desired content. It doesn’t affect the vma properties. The compaction
algorithm in this case works by first remapping the heap pages (using
mremap) to a secondary mapping and then registering the heap with
userfaultfd for MISSING faults. When an application thread accesses a
page that has not yet been mapped (by other GC/application threads), a
userfault occurs, and as a consequence the corresponding page is
generated and mapped using one of the following two ioctls.
1) COPY ioctl: Typically the heap would be private anonymous in this
case. For every page on the heap, compact the objects into a
page-sized buffer, which COPY ioctl takes as input. The ioctl
allocates a new page, copies the input buffer to it, and then maps it.
This means that even for updating references in the densely populated
regions (where compaction is not done), in-place updation is
impossible. This results in unnecessary page-clear, memcpy and
freeing.
2) CONTINUE ioctl: the two mappings (heap and secondary) are
MAP_SHARED to the same shmem file. Userfaults in the ‘fragmented’
regions are MISSING, in which case objects are compacted into the
corresponding secondary mapping page (which triggers a regular page
fault to get a page mapped) and then CONTINUE ioctl is invoked, which
maps the same page on the heap mapping. On the other hand, userfaults
in the ‘densely populated’ regions are MINOR (as the page already
exists in the secondary mapping), in which case we update the
references in the already existing page on the secondary mapping and
then invoke CONTINUE ioctl.

Limitation: we observed in our implementation that
page-faults/page-allocation, memcpy, and madvise took (with either of
the two ioctls) ~50% of the time spent in compaction.


III) USE CASE (of the proposed feature):
The proposed feature of moving pages from one vma to another will
enable us to:
A) Recycle pages entirely in the userspace as they are freed (pages
whose objects are already consumed as part of the current compaction
cycle) in the ‘fragmented’ regions. This way we avoid page-clearing
(during page allocation) and memcpy (in the kernel). When the page is
handed over to the kernel for remapping, there is nothing else needed
to be done. Furthermore, since the page is being reused, it doesn’t
have to be freed either.
B) Implement a coarse-grained page-level compaction algorithm wherein
pages containing live objects are slid next to each other without
touching them, while reclaiming in-between pages which contain only
garbage. Such an algorithm is very useful for compacting objects which
are seldom accessed by application and hence are likely to be swapped
out. Without this feature, this would require copying the pages
containing live objects, for which the src pages have to be
swapped-in, only to be soon swapped-out afterwards.

AFAIK, none of the above features can be implemented using mremap
(with current flags), irrespective of whether the heap is a shmem or
private anonymous mapping, because:
1) When moving a page it’s likely that its index will need to change
and mremapping such a page would result in VMA splitting.
2) Using mremap for moving pages would result in the heap’s range
being covered by several vmas. The mremap in the next compaction cycle
(required prior to starting compaction as described above), will fail
with EFAULT. This is because the src range in mremap is not allowed to
span multiple vmas. On the other hand, calling it for each src vma is
not feasible because:
  a) It’s not trivial to identify various vmas covering the heap range
in userspace, and
  b) This operation is supposed to happen with application threads
paused. Invoking numerous mremap syscalls in a pause risks causing
janks.
3) Mremap has scalability concerns due to the need to acquire mmap_sem
exclusively for splitting/merging VMAs. This would impact parallelism
of application threads, particularly during the beginning of the
compaction process when they are expected to cause a spurt of
userfaults.


IV) PROPOSAL:
Initially, maybe the feature can be implemented only for private
anonymous mappings. There are two ways this can be implemented:
A) A new userfaultfd ioctl, ‘MOVE’, which takes the same inputs as the
‘COPY’ ioctl. After sanity check, the ioctl would detach the pte
entries from the src vma, and move them to dst vma while updating
their ‘mapping’ and ‘index’ fields, if required.

B) Add a new flag to mremap, ‘MREMAP_ONLYPAGES’, which works similar
to the MOVE ioctl above.

Assuming (A) is implemented, here is broadly how the compaction would work:
* For a MISSING userfault in the ‘densely populated’ regions, update
pointers in-place in the secondary mapping page corresponding to the
fault address (on the heap) and then use the MOVE ioctl to map it on
the heap. In this case the ‘index’ field would remain the same.
* For a MISSING userfault in ‘fragmented’ regions, pick any freed page
in the secondary map, compact the objects corresponding to the fault
address in this page and then use MOVE ioctl to map it on the fault
address in the heap. This would require updating the ‘index’ field.
After compaction is completed, use madvise(MADV_DONTNEED) on the
secondary mapping to free any remaining pages.


Thanks,
Lokesh


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-02-16 22:27 RFC for new feature to move pages from one vma to another without split Lokesh Gidra
@ 2023-04-06 17:29 ` Peter Xu
  2023-04-10  7:41   ` Lokesh Gidra
  2023-04-12  8:47   ` David Hildenbrand
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Xu @ 2023-04-06 17:29 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Axel Rasmussen, Andrew Morton, open list:MEMORY MANAGEMENT,
	linux-kernel, Andrea Arcangeli, Kirill A . Shutemov,
	Kirill A. Shutemov, Brian Geffon, Suren Baghdasaryan,
	Kalesh Singh, Nicolas Geoffray, Jared Duke, android-mm,
	Blake Caldwell, Mike Rapoport

Hi, Lokesh,

Sorry for a late reply.  Copy Blake Caldwell and Mike too.

On Thu, Feb 16, 2023 at 02:27:11PM -0800, Lokesh Gidra wrote:
> I) SUMMARY:
> Requesting comments on a new feature which remaps pages from one
> private anonymous mapping to another, without altering the vmas
> involved. Two alternatives exist but both have drawbacks:
> 1. userfaultfd ioctls allocate new pages, copy data and free the old
> ones even when updates could be done in-place;
> 2. mremap results in vma splitting in most of the cases due to 'pgoff' mismatch.

Personally it was always a mistery to me on how vm_pgoff works with
anonymous vmas and why it needs to be setup with vm_start >> PAGE_SHIFT.

Just now I tried to apply below oneliner change:

@@ -1369,7 +1369,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
                        /*
                         * Set pgoff according to addr for anon_vma.
                         */
-                       pgoff = addr >> PAGE_SHIFT;
+                       pgoff = 0;
                        break;
                default:
                        return -EINVAL;

The kernel even boots without a major problem so far..

I had a feeling that I miss something else here, it'll be great if anyone
knows.

Anyway, I agree mremap() is definitely not the best way to do page level
operations like this, no matter whether vm_pgoff can match or not.

> 
> Proposing a new mremap flag or userfaultfd ioctl which enables
> remapping pages without these drawbacks. Such a feature, as described
> below, would be very helpful in efficient implementation of concurrent
> compaction algorithms.

After I read the proposal, I had a feeling that you're not aware that we
have similar proposals adding UFFDIO_REMAP.

I think it started with Andrea's initial proposal on the whole uffd:

https://lore.kernel.org/linux-mm/1425575884-2574-1-git-send-email-aarcange@redhat.com/

Then for some reason it's not merged in initial version, but at least it's
been proposed again here (even though it seems the goal is slightly
different; that may want to move page out instead of moving in):

https://lore.kernel.org/linux-mm/cover.1547251023.git.blake.caldwell@colorado.edu/

Also worth checking with the latest commit that Andrea maintains himself (I
doubt whether there's major changes, but still just to make it complete):

https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc19e92

So far I think that's what you're looking for. I'm not sure whether the
limitations will be a problem, though, at least mentioned in the old
proposals of UFFDIO_REMAP.  For example, it required not only anonymous but
also mapcount==1 on all src pages.  But maybe that's not a problem here
too.

> 
> II) MOTIVATION:
> Garbage collectors (like the ones used in managed languages) perform
> defragmentation of the managed heap by moving objects (of varying
> sizes) within the heap. Usually these algorithms have to be concurrent
> to avoid response time concerns. These are concurrent in the sense
> that while the GC threads are compacting the heap, application threads
> continue to make progress, which means enabling access to the heap
> while objects are being simultaneously moved.
> 
> Given the high overhead of heap compaction, such algorithms typically
> segregate the heap into two types of regions (set of contiguous
> pages): those that have enough fragmentation to compact, and those
> that are densely populated. While only ‘fragmented’ regions are
> compacted by sliding objects, both types of regions are traversed to
> update references in them to the moved objects.
> 
> A) PROT_NONE+SIGSEGV approach:
> One of the widely used techniques to ensure data integrity during
> concurrent compaction is to use page-level access interception.
> Traditionally, this is implemented by mprotecting (PROT_NONE) the heap
> before starting compaction and installing a SIGSEGV handler. When GC
> threads are compacting the heap, if some application threads fault on
> the heap, then they compact the faulted page in the SIGSEGV handler
> and then enable access to it before returning. To do this atomically,
> the heap must use shmem (MAP_SHARED) so that an alias mapping (with
> read-write permission) can be used for moving objects into and
> updating references.
> 
> Limitation: due to different access rights, the heap can end up with
> one vma per page in the worst case, hitting the ‘max_map_count’ limit.
> 
> B) Userfaultfd approach:
> Userfaultfd avoids the vma split issue by intercepting page-faults
> when the page is missing and gives control to user-space to map the
> desired content. It doesn’t affect the vma properties. The compaction
> algorithm in this case works by first remapping the heap pages (using
> mremap) to a secondary mapping and then registering the heap with
> userfaultfd for MISSING faults. When an application thread accesses a
> page that has not yet been mapped (by other GC/application threads), a
> userfault occurs, and as a consequence the corresponding page is
> generated and mapped using one of the following two ioctls.
> 1) COPY ioctl: Typically the heap would be private anonymous in this
> case. For every page on the heap, compact the objects into a
> page-sized buffer, which COPY ioctl takes as input. The ioctl
> allocates a new page, copies the input buffer to it, and then maps it.
> This means that even for updating references in the densely populated
> regions (where compaction is not done), in-place updation is
> impossible. This results in unnecessary page-clear, memcpy and
> freeing.
> 2) CONTINUE ioctl: the two mappings (heap and secondary) are
> MAP_SHARED to the same shmem file. Userfaults in the ‘fragmented’
> regions are MISSING, in which case objects are compacted into the
> corresponding secondary mapping page (which triggers a regular page
> fault to get a page mapped) and then CONTINUE ioctl is invoked, which
> maps the same page on the heap mapping. On the other hand, userfaults
> in the ‘densely populated’ regions are MINOR (as the page already
> exists in the secondary mapping), in which case we update the
> references in the already existing page on the secondary mapping and
> then invoke CONTINUE ioctl.
> 
> Limitation: we observed in our implementation that
> page-faults/page-allocation, memcpy, and madvise took (with either of
> the two ioctls) ~50% of the time spent in compaction.

I assume "page-faults" applies to CONTINUE, while "page-allocation" applies
to COPY here.  UFFDIO_REMAP can definitely avoid memcpy, but I don't know
how much it'll remove in total, e.g., I don't think page faults can be
avoided anyway?  Also, madvise(), depending on what it is.  If it's only
MADV_DONTNEED, maybe it'll be helpful too so the library can reuse wasted
pages directly hence reducing DONTNEEDs.

> III) USE CASE (of the proposed feature):
> The proposed feature of moving pages from one vma to another will
> enable us to:
> A) Recycle pages entirely in the userspace as they are freed (pages
> whose objects are already consumed as part of the current compaction
> cycle) in the ‘fragmented’ regions. This way we avoid page-clearing
> (during page allocation) and memcpy (in the kernel). When the page is
> handed over to the kernel for remapping, there is nothing else needed
> to be done. Furthermore, since the page is being reused, it doesn’t
> have to be freed either.
> B) Implement a coarse-grained page-level compaction algorithm wherein
> pages containing live objects are slid next to each other without
> touching them, while reclaiming in-between pages which contain only
> garbage. Such an algorithm is very useful for compacting objects which
> are seldom accessed by application and hence are likely to be swapped
> out. Without this feature, this would require copying the pages
> containing live objects, for which the src pages have to be
> swapped-in, only to be soon swapped-out afterwards.
> 
> AFAIK, none of the above features can be implemented using mremap
> (with current flags), irrespective of whether the heap is a shmem or
> private anonymous mapping, because:
> 1) When moving a page it’s likely that its index will need to change
> and mremapping such a page would result in VMA splitting.
> 2) Using mremap for moving pages would result in the heap’s range
> being covered by several vmas. The mremap in the next compaction cycle
> (required prior to starting compaction as described above), will fail
> with EFAULT. This is because the src range in mremap is not allowed to
> span multiple vmas. On the other hand, calling it for each src vma is
> not feasible because:
>   a) It’s not trivial to identify various vmas covering the heap range
> in userspace, and
>   b) This operation is supposed to happen with application threads
> paused. Invoking numerous mremap syscalls in a pause risks causing
> janks.
> 3) Mremap has scalability concerns due to the need to acquire mmap_sem
> exclusively for splitting/merging VMAs. This would impact parallelism
> of application threads, particularly during the beginning of the
> compaction process when they are expected to cause a spurt of
> userfaults.
> 
> 
> IV) PROPOSAL:
> Initially, maybe the feature can be implemented only for private
> anonymous mappings. There are two ways this can be implemented:
> A) A new userfaultfd ioctl, ‘MOVE’, which takes the same inputs as the
> ‘COPY’ ioctl. After sanity check, the ioctl would detach the pte
> entries from the src vma, and move them to dst vma while updating
> their ‘mapping’ and ‘index’ fields, if required.
> 
> B) Add a new flag to mremap, ‘MREMAP_ONLYPAGES’, which works similar
> to the MOVE ioctl above.
> 
> Assuming (A) is implemented, here is broadly how the compaction would work:
> * For a MISSING userfault in the ‘densely populated’ regions, update
> pointers in-place in the secondary mapping page corresponding to the
> fault address (on the heap) and then use the MOVE ioctl to map it on
> the heap. In this case the ‘index’ field would remain the same.
> * For a MISSING userfault in ‘fragmented’ regions, pick any freed page
> in the secondary map, compact the objects corresponding to the fault
> address in this page and then use MOVE ioctl to map it on the fault
> address in the heap. This would require updating the ‘index’ field.
> After compaction is completed, use madvise(MADV_DONTNEED) on the
> secondary mapping to free any remaining pages.
> 
> 
> Thanks,
> Lokesh
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-04-06 17:29 ` Peter Xu
@ 2023-04-10  7:41   ` Lokesh Gidra
  2023-04-11 15:14     ` Peter Xu
  2023-04-12  8:47   ` David Hildenbrand
  1 sibling, 1 reply; 15+ messages in thread
From: Lokesh Gidra @ 2023-04-10  7:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Axel Rasmussen, Andrew Morton, open list:MEMORY MANAGEMENT,
	linux-kernel, Andrea Arcangeli, Kirill A . Shutemov,
	Kirill A. Shutemov, Brian Geffon, Suren Baghdasaryan,
	Kalesh Singh, Nicolas Geoffray, Jared Duke, android-mm,
	Blake Caldwell, Mike Rapoport

On Thu, Apr 6, 2023 at 10:29 AM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Lokesh,
>
> Sorry for a late reply.  Copy Blake Caldwell and Mike too.

Thanks for the reply. It's extremely helpful.
>
> On Thu, Feb 16, 2023 at 02:27:11PM -0800, Lokesh Gidra wrote:
> > I) SUMMARY:
> > Requesting comments on a new feature which remaps pages from one
> > private anonymous mapping to another, without altering the vmas
> > involved. Two alternatives exist but both have drawbacks:
> > 1. userfaultfd ioctls allocate new pages, copy data and free the old
> > ones even when updates could be done in-place;
> > 2. mremap results in vma splitting in most of the cases due to 'pgoff' mismatch.
>
> Personally it was always a mistery to me on how vm_pgoff works with
> anonymous vmas and why it needs to be setup with vm_start >> PAGE_SHIFT.
>
> Just now I tried to apply below oneliner change:
>
> @@ -1369,7 +1369,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>                         /*
>                          * Set pgoff according to addr for anon_vma.
>                          */
> -                       pgoff = addr >> PAGE_SHIFT;
> +                       pgoff = 0;
>                         break;
>                 default:
>                         return -EINVAL;
>
> The kernel even boots without a major problem so far..
>
> I had a feeling that I miss something else here, it'll be great if anyone
> knows.
>
> Anyway, I agree mremap() is definitely not the best way to do page level
> operations like this, no matter whether vm_pgoff can match or not.
>
> >
> > Proposing a new mremap flag or userfaultfd ioctl which enables
> > remapping pages without these drawbacks. Such a feature, as described
> > below, would be very helpful in efficient implementation of concurrent
> > compaction algorithms.
>
> After I read the proposal, I had a feeling that you're not aware that we
> have similar proposals adding UFFDIO_REMAP.

Yes, I wasn't aware of this. Thanks a lot for sharing the details.
>
> I think it started with Andrea's initial proposal on the whole uffd:
>
> https://lore.kernel.org/linux-mm/1425575884-2574-1-git-send-email-aarcange@redhat.com/
>
> Then for some reason it's not merged in initial version, but at least it's
> been proposed again here (even though it seems the goal is slightly
> different; that may want to move page out instead of moving in):
>
> https://lore.kernel.org/linux-mm/cover.1547251023.git.blake.caldwell@colorado.edu/

Yeah, this seems to be the opposite of what I'm looking for. IIUC,
page out REMAP can't
satisfy any MISSING userfault. In fact, it enables MISSING faults in
future. Maybe a flag
can be added to uffdio_remap struct to accommodate this case, if it is
still being pursued.
>
> Also worth checking with the latest commit that Andrea maintains himself (I
> doubt whether there's major changes, but still just to make it complete):
>
> https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc19e92
>
> So far I think that's what you're looking for. I'm not sure whether the
> limitations will be a problem, though, at least mentioned in the old
> proposals of UFFDIO_REMAP.  For example, it required not only anonymous but
> also mapcount==1 on all src pages.  But maybe that's not a problem here
> too.

Yes, this is exactly what I am looking for. The mapcount==1 is not a
problem either. Any idea why the patch isn't merged?

>
> >
> > II) MOTIVATION:
> > Garbage collectors (like the ones used in managed languages) perform
> > defragmentation of the managed heap by moving objects (of varying
> > sizes) within the heap. Usually these algorithms have to be concurrent
> > to avoid response time concerns. These are concurrent in the sense
> > that while the GC threads are compacting the heap, application threads
> > continue to make progress, which means enabling access to the heap
> > while objects are being simultaneously moved.
> >
> > Given the high overhead of heap compaction, such algorithms typically
> > segregate the heap into two types of regions (set of contiguous
> > pages): those that have enough fragmentation to compact, and those
> > that are densely populated. While only ‘fragmented’ regions are
> > compacted by sliding objects, both types of regions are traversed to
> > update references in them to the moved objects.
> >
> > A) PROT_NONE+SIGSEGV approach:
> > One of the widely used techniques to ensure data integrity during
> > concurrent compaction is to use page-level access interception.
> > Traditionally, this is implemented by mprotecting (PROT_NONE) the heap
> > before starting compaction and installing a SIGSEGV handler. When GC
> > threads are compacting the heap, if some application threads fault on
> > the heap, then they compact the faulted page in the SIGSEGV handler
> > and then enable access to it before returning. To do this atomically,
> > the heap must use shmem (MAP_SHARED) so that an alias mapping (with
> > read-write permission) can be used for moving objects into and
> > updating references.
> >
> > Limitation: due to different access rights, the heap can end up with
> > one vma per page in the worst case, hitting the ‘max_map_count’ limit.
> >
> > B) Userfaultfd approach:
> > Userfaultfd avoids the vma split issue by intercepting page-faults
> > when the page is missing and gives control to user-space to map the
> > desired content. It doesn’t affect the vma properties. The compaction
> > algorithm in this case works by first remapping the heap pages (using
> > mremap) to a secondary mapping and then registering the heap with
> > userfaultfd for MISSING faults. When an application thread accesses a
> > page that has not yet been mapped (by other GC/application threads), a
> > userfault occurs, and as a consequence the corresponding page is
> > generated and mapped using one of the following two ioctls.
> > 1) COPY ioctl: Typically the heap would be private anonymous in this
> > case. For every page on the heap, compact the objects into a
> > page-sized buffer, which COPY ioctl takes as input. The ioctl
> > allocates a new page, copies the input buffer to it, and then maps it.
> > This means that even for updating references in the densely populated
> > regions (where compaction is not done), in-place updation is
> > impossible. This results in unnecessary page-clear, memcpy and
> > freeing.
> > 2) CONTINUE ioctl: the two mappings (heap and secondary) are
> > MAP_SHARED to the same shmem file. Userfaults in the ‘fragmented’
> > regions are MISSING, in which case objects are compacted into the
> > corresponding secondary mapping page (which triggers a regular page
> > fault to get a page mapped) and then CONTINUE ioctl is invoked, which
> > maps the same page on the heap mapping. On the other hand, userfaults
> > in the ‘densely populated’ regions are MINOR (as the page already
> > exists in the secondary mapping), in which case we update the
> > references in the already existing page on the secondary mapping and
> > then invoke CONTINUE ioctl.
> >
> > Limitation: we observed in our implementation that
> > page-faults/page-allocation, memcpy, and madvise took (with either of
> > the two ioctls) ~50% of the time spent in compaction.
>
> I assume "page-faults" applies to CONTINUE, while "page-allocation" applies
> to COPY here.  UFFDIO_REMAP can definitely avoid memcpy, but I don't know
> how much it'll remove in total, e.g., I don't think page faults can be
> avoided anyway?  Also, madvise(), depending on what it is.  If it's only
> MADV_DONTNEED, maybe it'll be helpful too so the library can reuse wasted
> pages directly hence reducing DONTNEEDs.
>
That's right. page-faults -> CONTINUE and page-allocation -> COPY. The
GC algorithm
I'm describing here is mostly page-fault free as the heap pages are recycled.

Basically, the heap is mremapped to a secondary mapping so that we can
start receiving MISSING faults
on the heap after userfaultfd registration. Consequently, on every
MISSING userfault, the pages from the
secondary mapping are prepared in-place before acting as 'src' for
UFFDIO_REMAP ioctl call.

Also, as you said, MADV_DONTNEED will be mostly eliminated as most of
the pages are recycled in userspace.

There are other things too that UFFDIO_REMAP enables us to do. It
allows coarse-grained page-by-page compaction
of the heap without swapping-in the pages. This isn't possible today.

> > III) USE CASE (of the proposed feature):
> > The proposed feature of moving pages from one vma to another will
> > enable us to:
> > A) Recycle pages entirely in the userspace as they are freed (pages
> > whose objects are already consumed as part of the current compaction
> > cycle) in the ‘fragmented’ regions. This way we avoid page-clearing
> > (during page allocation) and memcpy (in the kernel). When the page is
> > handed over to the kernel for remapping, there is nothing else needed
> > to be done. Furthermore, since the page is being reused, it doesn’t
> > have to be freed either.
> > B) Implement a coarse-grained page-level compaction algorithm wherein
> > pages containing live objects are slid next to each other without
> > touching them, while reclaiming in-between pages which contain only
> > garbage. Such an algorithm is very useful for compacting objects which
> > are seldom accessed by application and hence are likely to be swapped
> > out. Without this feature, this would require copying the pages
> > containing live objects, for which the src pages have to be
> > swapped-in, only to be soon swapped-out afterwards.
> >
> > AFAIK, none of the above features can be implemented using mremap
> > (with current flags), irrespective of whether the heap is a shmem or
> > private anonymous mapping, because:
> > 1) When moving a page it’s likely that its index will need to change
> > and mremapping such a page would result in VMA splitting.
> > 2) Using mremap for moving pages would result in the heap’s range
> > being covered by several vmas. The mremap in the next compaction cycle
> > (required prior to starting compaction as described above), will fail
> > with EFAULT. This is because the src range in mremap is not allowed to
> > span multiple vmas. On the other hand, calling it for each src vma is
> > not feasible because:
> >   a) It’s not trivial to identify various vmas covering the heap range
> > in userspace, and
> >   b) This operation is supposed to happen with application threads
> > paused. Invoking numerous mremap syscalls in a pause risks causing
> > janks.
> > 3) Mremap has scalability concerns due to the need to acquire mmap_sem
> > exclusively for splitting/merging VMAs. This would impact parallelism
> > of application threads, particularly during the beginning of the
> > compaction process when they are expected to cause a spurt of
> > userfaults.
> >
> >
> > IV) PROPOSAL:
> > Initially, maybe the feature can be implemented only for private
> > anonymous mappings. There are two ways this can be implemented:
> > A) A new userfaultfd ioctl, ‘MOVE’, which takes the same inputs as the
> > ‘COPY’ ioctl. After sanity check, the ioctl would detach the pte
> > entries from the src vma, and move them to dst vma while updating
> > their ‘mapping’ and ‘index’ fields, if required.
> >
> > B) Add a new flag to mremap, ‘MREMAP_ONLYPAGES’, which works similar
> > to the MOVE ioctl above.
> >
> > Assuming (A) is implemented, here is broadly how the compaction would work:
> > * For a MISSING userfault in the ‘densely populated’ regions, update
> > pointers in-place in the secondary mapping page corresponding to the
> > fault address (on the heap) and then use the MOVE ioctl to map it on
> > the heap. In this case the ‘index’ field would remain the same.
> > * For a MISSING userfault in ‘fragmented’ regions, pick any freed page
> > in the secondary map, compact the objects corresponding to the fault
> > address in this page and then use MOVE ioctl to map it on the fault
> > address in the heap. This would require updating the ‘index’ field.
> > After compaction is completed, use madvise(MADV_DONTNEED) on the
> > secondary mapping to free any remaining pages.
> >
> >
> > Thanks,
> > Lokesh
> >
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-04-10  7:41   ` Lokesh Gidra
@ 2023-04-11 15:14     ` Peter Xu
  2023-05-08 22:56       ` Lokesh Gidra
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2023-04-11 15:14 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Axel Rasmussen, Andrew Morton, open list:MEMORY MANAGEMENT,
	linux-kernel, Andrea Arcangeli, Kirill A . Shutemov,
	Kirill A. Shutemov, Brian Geffon, Suren Baghdasaryan,
	Kalesh Singh, Nicolas Geoffray, Jared Duke, android-mm,
	Blake Caldwell, Mike Rapoport

On Mon, Apr 10, 2023 at 12:41:31AM -0700, Lokesh Gidra wrote:
> On Thu, Apr 6, 2023 at 10:29 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Lokesh,
> >
> > Sorry for a late reply.  Copy Blake Caldwell and Mike too.
> 
> Thanks for the reply. It's extremely helpful.
> >
> > On Thu, Feb 16, 2023 at 02:27:11PM -0800, Lokesh Gidra wrote:
> > > I) SUMMARY:
> > > Requesting comments on a new feature which remaps pages from one
> > > private anonymous mapping to another, without altering the vmas
> > > involved. Two alternatives exist but both have drawbacks:
> > > 1. userfaultfd ioctls allocate new pages, copy data and free the old
> > > ones even when updates could be done in-place;
> > > 2. mremap results in vma splitting in most of the cases due to 'pgoff' mismatch.
> >
> > Personally it was always a mistery to me on how vm_pgoff works with
> > anonymous vmas and why it needs to be setup with vm_start >> PAGE_SHIFT.
> >
> > Just now I tried to apply below oneliner change:
> >
> > @@ -1369,7 +1369,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> >                         /*
> >                          * Set pgoff according to addr for anon_vma.
> >                          */
> > -                       pgoff = addr >> PAGE_SHIFT;
> > +                       pgoff = 0;
> >                         break;
> >                 default:
> >                         return -EINVAL;
> >
> > The kernel even boots without a major problem so far..
> >
> > I had a feeling that I miss something else here, it'll be great if anyone
> > knows.
> >
> > Anyway, I agree mremap() is definitely not the best way to do page level
> > operations like this, no matter whether vm_pgoff can match or not.
> >
> > >
> > > Proposing a new mremap flag or userfaultfd ioctl which enables
> > > remapping pages without these drawbacks. Such a feature, as described
> > > below, would be very helpful in efficient implementation of concurrent
> > > compaction algorithms.
> >
> > After I read the proposal, I had a feeling that you're not aware that we
> > have similar proposals adding UFFDIO_REMAP.
> 
> Yes, I wasn't aware of this. Thanks a lot for sharing the details.
> >
> > I think it started with Andrea's initial proposal on the whole uffd:
> >
> > https://lore.kernel.org/linux-mm/1425575884-2574-1-git-send-email-aarcange@redhat.com/
> >
> > Then for some reason it's not merged in initial version, but at least it's
> > been proposed again here (even though it seems the goal is slightly
> > different; that may want to move page out instead of moving in):
> >
> > https://lore.kernel.org/linux-mm/cover.1547251023.git.blake.caldwell@colorado.edu/
> 
> Yeah, this seems to be the opposite of what I'm looking for. IIUC,
> page out REMAP can't
> satisfy any MISSING userfault. In fact, it enables MISSING faults in
> future. Maybe a flag
> can be added to uffdio_remap struct to accommodate this case, if it is
> still being pursued.

Yes, I don't think that's a major problem if the use cases share mostly the
same fundation.

> >
> > Also worth checking with the latest commit that Andrea maintains himself (I
> > doubt whether there's major changes, but still just to make it complete):
> >
> > https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc19e92
> >
> > So far I think that's what you're looking for. I'm not sure whether the
> > limitations will be a problem, though, at least mentioned in the old
> > proposals of UFFDIO_REMAP.  For example, it required not only anonymous but
> > also mapcount==1 on all src pages.  But maybe that's not a problem here
> > too.
> 
> Yes, this is exactly what I am looking for. The mapcount==1 is not a
> problem either. Any idea why the patch isn't merged?

The initial verion of discussion mentioned some of the reason of lacking
use cases:

https://lore.kernel.org/linux-mm/20150305185112.GL4280@redhat.com/

But I am not sure of the latter one.  Maybe Mike will know.

> 
> >
> > >
> > > II) MOTIVATION:
> > > Garbage collectors (like the ones used in managed languages) perform
> > > defragmentation of the managed heap by moving objects (of varying
> > > sizes) within the heap. Usually these algorithms have to be concurrent
> > > to avoid response time concerns. These are concurrent in the sense
> > > that while the GC threads are compacting the heap, application threads
> > > continue to make progress, which means enabling access to the heap
> > > while objects are being simultaneously moved.
> > >
> > > Given the high overhead of heap compaction, such algorithms typically
> > > segregate the heap into two types of regions (set of contiguous
> > > pages): those that have enough fragmentation to compact, and those
> > > that are densely populated. While only ‘fragmented’ regions are
> > > compacted by sliding objects, both types of regions are traversed to
> > > update references in them to the moved objects.
> > >
> > > A) PROT_NONE+SIGSEGV approach:
> > > One of the widely used techniques to ensure data integrity during
> > > concurrent compaction is to use page-level access interception.
> > > Traditionally, this is implemented by mprotecting (PROT_NONE) the heap
> > > before starting compaction and installing a SIGSEGV handler. When GC
> > > threads are compacting the heap, if some application threads fault on
> > > the heap, then they compact the faulted page in the SIGSEGV handler
> > > and then enable access to it before returning. To do this atomically,
> > > the heap must use shmem (MAP_SHARED) so that an alias mapping (with
> > > read-write permission) can be used for moving objects into and
> > > updating references.
> > >
> > > Limitation: due to different access rights, the heap can end up with
> > > one vma per page in the worst case, hitting the ‘max_map_count’ limit.
> > >
> > > B) Userfaultfd approach:
> > > Userfaultfd avoids the vma split issue by intercepting page-faults
> > > when the page is missing and gives control to user-space to map the
> > > desired content. It doesn’t affect the vma properties. The compaction
> > > algorithm in this case works by first remapping the heap pages (using
> > > mremap) to a secondary mapping and then registering the heap with
> > > userfaultfd for MISSING faults. When an application thread accesses a
> > > page that has not yet been mapped (by other GC/application threads), a
> > > userfault occurs, and as a consequence the corresponding page is
> > > generated and mapped using one of the following two ioctls.
> > > 1) COPY ioctl: Typically the heap would be private anonymous in this
> > > case. For every page on the heap, compact the objects into a
> > > page-sized buffer, which COPY ioctl takes as input. The ioctl
> > > allocates a new page, copies the input buffer to it, and then maps it.
> > > This means that even for updating references in the densely populated
> > > regions (where compaction is not done), in-place updation is
> > > impossible. This results in unnecessary page-clear, memcpy and
> > > freeing.
> > > 2) CONTINUE ioctl: the two mappings (heap and secondary) are
> > > MAP_SHARED to the same shmem file. Userfaults in the ‘fragmented’
> > > regions are MISSING, in which case objects are compacted into the
> > > corresponding secondary mapping page (which triggers a regular page
> > > fault to get a page mapped) and then CONTINUE ioctl is invoked, which
> > > maps the same page on the heap mapping. On the other hand, userfaults
> > > in the ‘densely populated’ regions are MINOR (as the page already
> > > exists in the secondary mapping), in which case we update the
> > > references in the already existing page on the secondary mapping and
> > > then invoke CONTINUE ioctl.
> > >
> > > Limitation: we observed in our implementation that
> > > page-faults/page-allocation, memcpy, and madvise took (with either of
> > > the two ioctls) ~50% of the time spent in compaction.
> >
> > I assume "page-faults" applies to CONTINUE, while "page-allocation" applies
> > to COPY here.  UFFDIO_REMAP can definitely avoid memcpy, but I don't know
> > how much it'll remove in total, e.g., I don't think page faults can be
> > avoided anyway?  Also, madvise(), depending on what it is.  If it's only
> > MADV_DONTNEED, maybe it'll be helpful too so the library can reuse wasted
> > pages directly hence reducing DONTNEEDs.
> >
> That's right. page-faults -> CONTINUE and page-allocation -> COPY. The
> GC algorithm
> I'm describing here is mostly page-fault free as the heap pages are recycled.
> 
> Basically, the heap is mremapped to a secondary mapping so that we can
> start receiving MISSING faults
> on the heap after userfaultfd registration. Consequently, on every
> MISSING userfault, the pages from the
> secondary mapping are prepared in-place before acting as 'src' for
> UFFDIO_REMAP ioctl call.
> 
> Also, as you said, MADV_DONTNEED will be mostly eliminated as most of
> the pages are recycled in userspace.
> 
> There are other things too that UFFDIO_REMAP enables us to do. It
> allows coarse-grained page-by-page compaction
> of the heap without swapping-in the pages. This isn't possible today.
> 
> > > III) USE CASE (of the proposed feature):
> > > The proposed feature of moving pages from one vma to another will
> > > enable us to:
> > > A) Recycle pages entirely in the userspace as they are freed (pages
> > > whose objects are already consumed as part of the current compaction
> > > cycle) in the ‘fragmented’ regions. This way we avoid page-clearing
> > > (during page allocation) and memcpy (in the kernel). When the page is
> > > handed over to the kernel for remapping, there is nothing else needed
> > > to be done. Furthermore, since the page is being reused, it doesn’t
> > > have to be freed either.
> > > B) Implement a coarse-grained page-level compaction algorithm wherein
> > > pages containing live objects are slid next to each other without
> > > touching them, while reclaiming in-between pages which contain only
> > > garbage. Such an algorithm is very useful for compacting objects which
> > > are seldom accessed by application and hence are likely to be swapped
> > > out. Without this feature, this would require copying the pages
> > > containing live objects, for which the src pages have to be
> > > swapped-in, only to be soon swapped-out afterwards.
> > >
> > > AFAIK, none of the above features can be implemented using mremap
> > > (with current flags), irrespective of whether the heap is a shmem or
> > > private anonymous mapping, because:
> > > 1) When moving a page it’s likely that its index will need to change
> > > and mremapping such a page would result in VMA splitting.
> > > 2) Using mremap for moving pages would result in the heap’s range
> > > being covered by several vmas. The mremap in the next compaction cycle
> > > (required prior to starting compaction as described above), will fail
> > > with EFAULT. This is because the src range in mremap is not allowed to
> > > span multiple vmas. On the other hand, calling it for each src vma is
> > > not feasible because:
> > >   a) It’s not trivial to identify various vmas covering the heap range
> > > in userspace, and
> > >   b) This operation is supposed to happen with application threads
> > > paused. Invoking numerous mremap syscalls in a pause risks causing
> > > janks.
> > > 3) Mremap has scalability concerns due to the need to acquire mmap_sem
> > > exclusively for splitting/merging VMAs. This would impact parallelism
> > > of application threads, particularly during the beginning of the
> > > compaction process when they are expected to cause a spurt of
> > > userfaults.
> > >
> > >
> > > IV) PROPOSAL:
> > > Initially, maybe the feature can be implemented only for private
> > > anonymous mappings. There are two ways this can be implemented:
> > > A) A new userfaultfd ioctl, ‘MOVE’, which takes the same inputs as the
> > > ‘COPY’ ioctl. After sanity check, the ioctl would detach the pte
> > > entries from the src vma, and move them to dst vma while updating
> > > their ‘mapping’ and ‘index’ fields, if required.
> > >
> > > B) Add a new flag to mremap, ‘MREMAP_ONLYPAGES’, which works similar
> > > to the MOVE ioctl above.
> > >
> > > Assuming (A) is implemented, here is broadly how the compaction would work:
> > > * For a MISSING userfault in the ‘densely populated’ regions, update
> > > pointers in-place in the secondary mapping page corresponding to the
> > > fault address (on the heap) and then use the MOVE ioctl to map it on
> > > the heap. In this case the ‘index’ field would remain the same.
> > > * For a MISSING userfault in ‘fragmented’ regions, pick any freed page
> > > in the secondary map, compact the objects corresponding to the fault
> > > address in this page and then use MOVE ioctl to map it on the fault
> > > address in the heap. This would require updating the ‘index’ field.
> > > After compaction is completed, use madvise(MADV_DONTNEED) on the
> > > secondary mapping to free any remaining pages.
> > >
> > >
> > > Thanks,
> > > Lokesh
> > >
> >
> > --
> > Peter Xu

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-04-06 17:29 ` Peter Xu
  2023-04-10  7:41   ` Lokesh Gidra
@ 2023-04-12  8:47   ` David Hildenbrand
  2023-04-12 15:58     ` Peter Xu
  1 sibling, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2023-04-12  8:47 UTC (permalink / raw)
  To: Peter Xu, Lokesh Gidra
  Cc: Axel Rasmussen, Andrew Morton, open list:MEMORY MANAGEMENT,
	linux-kernel, Andrea Arcangeli, Kirill A . Shutemov,
	Kirill A. Shutemov, Brian Geffon, Suren Baghdasaryan,
	Kalesh Singh, Nicolas Geoffray, Jared Duke, android-mm,
	Blake Caldwell, Mike Rapoport

On 06.04.23 19:29, Peter Xu wrote:
> Hi, Lokesh,
> 
> Sorry for a late reply.  Copy Blake Caldwell and Mike too.
> 
> On Thu, Feb 16, 2023 at 02:27:11PM -0800, Lokesh Gidra wrote:
>> I) SUMMARY:
>> Requesting comments on a new feature which remaps pages from one
>> private anonymous mapping to another, without altering the vmas
>> involved. Two alternatives exist but both have drawbacks:
>> 1. userfaultfd ioctls allocate new pages, copy data and free the old
>> ones even when updates could be done in-place;
>> 2. mremap results in vma splitting in most of the cases due to 'pgoff' mismatch.
> 
> Personally it was always a mistery to me on how vm_pgoff works with
> anonymous vmas and why it needs to be setup with vm_start >> PAGE_SHIFT.
> 
> Just now I tried to apply below oneliner change:
> 
> @@ -1369,7 +1369,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>                          /*
>                           * Set pgoff according to addr for anon_vma.
>                           */
> -                       pgoff = addr >> PAGE_SHIFT;
> +                       pgoff = 0;
>                          break;
>                  default:
>                          return -EINVAL;
> 
> The kernel even boots without a major problem so far..

I think it's for RMAP purposes.

Take a look at linear_page_index() and how it's, for example, used in 
ksm_might_need_to_copy() alongside page->index.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-04-12  8:47   ` David Hildenbrand
@ 2023-04-12 15:58     ` Peter Xu
  2023-04-13  8:10       ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2023-04-12 15:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lokesh Gidra, Axel Rasmussen, Andrew Morton,
	open list:MEMORY MANAGEMENT, linux-kernel, Andrea Arcangeli,
	Kirill A . Shutemov, Kirill A. Shutemov, Brian Geffon,
	Suren Baghdasaryan, Kalesh Singh, Nicolas Geoffray, Jared Duke,
	android-mm, Blake Caldwell, Mike Rapoport

On Wed, Apr 12, 2023 at 10:47:52AM +0200, David Hildenbrand wrote:
> > Personally it was always a mistery to me on how vm_pgoff works with
> > anonymous vmas and why it needs to be setup with vm_start >> PAGE_SHIFT.
> > 
> > Just now I tried to apply below oneliner change:
> > 
> > @@ -1369,7 +1369,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> >                          /*
> >                           * Set pgoff according to addr for anon_vma.
> >                           */
> > -                       pgoff = addr >> PAGE_SHIFT;
> > +                       pgoff = 0;
> >                          break;
> >                  default:
> >                          return -EINVAL;
> > 
> > The kernel even boots without a major problem so far..
> 
> I think it's for RMAP purposes.
> 
> Take a look at linear_page_index() and how it's, for example, used in
> ksm_might_need_to_copy() alongside page->index.

From what I read, the vma's vm_pgoff is set before setup any page->index
within the vma, while the latter will be calculated out of the vma pgoff
with linear_page_index() (in __page_set_anon_rmap()).

	folio->index = linear_page_index(vma, address);

I think I missed something, but it seems to me any comparisions between
page->index and linear_page_index() will just keep working for anonymous
even if we change vma pgoff to 0 when vma is mapped.

Do you perhaps mean this is needed for ksm only?  I really am not familiar
enough with ksm, especially when it's swapped out.  I do see that
ksm_might_need_to_copy() wants to avoid reusing a page if anon_vma is setup
not for current vma, but I don't know when it'll happen.

	if (PageKsm(page)) {
		if (page_stable_node(page) &&
		    !(ksm_run & KSM_RUN_UNMERGE))
			return page;	/* no need to copy it */
	} else if (!anon_vma) {
		return page;		/* no need to copy it */
	} else if (page->index == linear_page_index(vma, address) &&
			anon_vma->root == vma->anon_vma->root) {
		return page;		/* still no need to copy it */
	}

I think when all these paths don't trigger (aka, we need to copy) it means
there's anon_vma assigned to the page but not the right one (even though I
don't know how that could happen..).  Meanwhile I don't see either on how
vma pg_off affects this (and I assume a real KSM page ignores page->index
completely).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-04-12 15:58     ` Peter Xu
@ 2023-04-13  8:10       ` David Hildenbrand
  2023-04-13 15:36         ` Peter Xu
  2023-06-07 20:17         ` Lorenzo Stoakes
  0 siblings, 2 replies; 15+ messages in thread
From: David Hildenbrand @ 2023-04-13  8:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Lokesh Gidra, Axel Rasmussen, Andrew Morton,
	open list:MEMORY MANAGEMENT, linux-kernel, Andrea Arcangeli,
	Kirill A . Shutemov, Kirill A. Shutemov, Brian Geffon,
	Suren Baghdasaryan, Kalesh Singh, Nicolas Geoffray, Jared Duke,
	android-mm, Blake Caldwell, Mike Rapoport

On 12.04.23 17:58, Peter Xu wrote:
> On Wed, Apr 12, 2023 at 10:47:52AM +0200, David Hildenbrand wrote:
>>> Personally it was always a mistery to me on how vm_pgoff works with
>>> anonymous vmas and why it needs to be setup with vm_start >> PAGE_SHIFT.
>>>
>>> Just now I tried to apply below oneliner change:
>>>
>>> @@ -1369,7 +1369,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>>>                           /*
>>>                            * Set pgoff according to addr for anon_vma.
>>>                            */
>>> -                       pgoff = addr >> PAGE_SHIFT;
>>> +                       pgoff = 0;
>>>                           break;
>>>                   default:
>>>                           return -EINVAL;
>>>
>>> The kernel even boots without a major problem so far..
>>
>> I think it's for RMAP purposes.
>>
>> Take a look at linear_page_index() and how it's, for example, used in
>> ksm_might_need_to_copy() alongside page->index.
> 
>  From what I read, the vma's vm_pgoff is set before setup any page->index
> within the vma, while the latter will be calculated out of the vma pgoff
> with linear_page_index() (in __page_set_anon_rmap()).
> 
> 	folio->index = linear_page_index(vma, address);
> 
> I think I missed something, but it seems to me any comparisions between
> page->index and linear_page_index() will just keep working for anonymous
> even if we change vma pgoff to 0 when vma is mapped.
> 
> Do you perhaps mean this is needed for ksm only?  I really am not familiar
> enough with ksm, especially when it's swapped out.  I do see that
> ksm_might_need_to_copy() wants to avoid reusing a page if anon_vma is setup
> not for current vma, but I don't know when it'll happen.
> 
> 	if (PageKsm(page)) {
> 		if (page_stable_node(page) &&
> 		    !(ksm_run & KSM_RUN_UNMERGE))
> 			return page;	/* no need to copy it */
> 	} else if (!anon_vma) {
> 		return page;		/* no need to copy it */
> 	} else if (page->index == linear_page_index(vma, address) &&
> 			anon_vma->root == vma->anon_vma->root) {
> 		return page;		/* still no need to copy it */
> 	}
> 
> I think when all these paths don't trigger (aka, we need to copy) it means
> there's anon_vma assigned to the page but not the right one (even though I
> don't know how that could happen..).  Meanwhile I don't see either on how
> vma pg_off affects this (and I assume a real KSM page ignores page->index
> completely).

I think you are right with folio->index = linear_page_index(vma, address).

I did not check the code yet, but thinking about it I figured out why we 
want to set pgoff to the start of the VMA in the address space for 
anonymous memory:

For RMAP and friends (relying on linear_page_index), folio->index has to 
match the index within the VMA. If would set pgoff to something else, 
we'd have less VMA merging opportunities. So your system might work, but 
you'd end up with many anon VMAs.


Imagine the following:

[ anon0 ][  fd   ][ anon1 ]

Unmap the fd:

[ anon0 ][ hole  ][ anon1 ]

Mmap anon:

[ anon0 ][ anon2 ][ anon1 ]


We can now merge all 3 VMAs into one, even if the first and latter 
already map pages.


A simpler and more common example is probably:

[ anon0 ]

Mmmap anon1 before the existing one

[ anon1 ][ anon0 ]

Which we can merge into a single one.



Mapping after an existing one could work, but one would have to 
carefully set pgoff based on the size of the previous anon VMA ... which 
is more complicated

So instead, we consider the whole address space as a virtual, anon file, 
starting at offset 0. The pgoff of a VMA is then simply the offset in 
that virtual file (easily computed from the start of the VMA), and VMA 
merging is just the same as for an ordinary file.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-04-13  8:10       ` David Hildenbrand
@ 2023-04-13 15:36         ` Peter Xu
  2023-06-06 20:15           ` Vlastimil Babka
  2023-06-07 20:17         ` Lorenzo Stoakes
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Xu @ 2023-04-13 15:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lokesh Gidra, Axel Rasmussen, Andrew Morton,
	open list:MEMORY MANAGEMENT, linux-kernel, Andrea Arcangeli,
	Kirill A . Shutemov, Kirill A. Shutemov, Brian Geffon,
	Suren Baghdasaryan, Kalesh Singh, Nicolas Geoffray, Jared Duke,
	android-mm, Blake Caldwell, Mike Rapoport

On Thu, Apr 13, 2023 at 10:10:44AM +0200, David Hildenbrand wrote:
> So instead, we consider the whole address space as a virtual, anon file,
> starting at offset 0. The pgoff of a VMA is then simply the offset in that
> virtual file (easily computed from the start of the VMA), and VMA merging is
> just the same as for an ordinary file.

Interesting point, thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-04-11 15:14     ` Peter Xu
@ 2023-05-08 22:56       ` Lokesh Gidra
  2023-05-16 16:43         ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Lokesh Gidra @ 2023-05-08 22:56 UTC (permalink / raw)
  To: Peter Xu, Andrea Arcangeli
  Cc: Axel Rasmussen, Andrew Morton, open list:MEMORY MANAGEMENT,
	linux-kernel, Kirill A . Shutemov, Kirill A. Shutemov,
	Brian Geffon, Suren Baghdasaryan, Kalesh Singh, Nicolas Geoffray,
	Jared Duke, android-mm, Blake Caldwell, Mike Rapoport

On Tue, Apr 11, 2023 at 8:14 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Apr 10, 2023 at 12:41:31AM -0700, Lokesh Gidra wrote:
> > On Thu, Apr 6, 2023 at 10:29 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Hi, Lokesh,
> > >
> > > Sorry for a late reply.  Copy Blake Caldwell and Mike too.
> >
> > Thanks for the reply. It's extremely helpful.
> > >
> > > On Thu, Feb 16, 2023 at 02:27:11PM -0800, Lokesh Gidra wrote:
> > > > I) SUMMARY:
> > > > Requesting comments on a new feature which remaps pages from one
> > > > private anonymous mapping to another, without altering the vmas
> > > > involved. Two alternatives exist but both have drawbacks:
> > > > 1. userfaultfd ioctls allocate new pages, copy data and free the old
> > > > ones even when updates could be done in-place;
> > > > 2. mremap results in vma splitting in most of the cases due to 'pgoff' mismatch.
> > >
> > > Personally it was always a mistery to me on how vm_pgoff works with
> > > anonymous vmas and why it needs to be setup with vm_start >> PAGE_SHIFT.
> > >
> > > Just now I tried to apply below oneliner change:
> > >
> > > @@ -1369,7 +1369,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> > >                         /*
> > >                          * Set pgoff according to addr for anon_vma.
> > >                          */
> > > -                       pgoff = addr >> PAGE_SHIFT;
> > > +                       pgoff = 0;
> > >                         break;
> > >                 default:
> > >                         return -EINVAL;
> > >
> > > The kernel even boots without a major problem so far..
> > >
> > > I had a feeling that I miss something else here, it'll be great if anyone
> > > knows.
> > >
> > > Anyway, I agree mremap() is definitely not the best way to do page level
> > > operations like this, no matter whether vm_pgoff can match or not.
> > >
> > > >
> > > > Proposing a new mremap flag or userfaultfd ioctl which enables
> > > > remapping pages without these drawbacks. Such a feature, as described
> > > > below, would be very helpful in efficient implementation of concurrent
> > > > compaction algorithms.
> > >
> > > After I read the proposal, I had a feeling that you're not aware that we
> > > have similar proposals adding UFFDIO_REMAP.
> >
> > Yes, I wasn't aware of this. Thanks a lot for sharing the details.
> > >
> > > I think it started with Andrea's initial proposal on the whole uffd:
> > >
> > > https://lore.kernel.org/linux-mm/1425575884-2574-1-git-send-email-aarcange@redhat.com/
> > >
> > > Then for some reason it's not merged in initial version, but at least it's
> > > been proposed again here (even though it seems the goal is slightly
> > > different; that may want to move page out instead of moving in):
> > >
> > > https://lore.kernel.org/linux-mm/cover.1547251023.git.blake.caldwell@colorado.edu/
> >
> > Yeah, this seems to be the opposite of what I'm looking for. IIUC,
> > page out REMAP can't
> > satisfy any MISSING userfault. In fact, it enables MISSING faults in
> > future. Maybe a flag
> > can be added to uffdio_remap struct to accommodate this case, if it is
> > still being pursued.
>
> Yes, I don't think that's a major problem if the use cases share mostly the
> same fundation.
>
> > >
> > > Also worth checking with the latest commit that Andrea maintains himself (I
> > > doubt whether there's major changes, but still just to make it complete):
> > >
> > > https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc19e92
> > >
> > > So far I think that's what you're looking for. I'm not sure whether the
> > > limitations will be a problem, though, at least mentioned in the old
> > > proposals of UFFDIO_REMAP.  For example, it required not only anonymous but
> > > also mapcount==1 on all src pages.  But maybe that's not a problem here
> > > too.
> >
> > Yes, this is exactly what I am looking for. The mapcount==1 is not a
> > problem either. Any idea why the patch isn't merged?
>
> The initial verion of discussion mentioned some of the reason of lacking
> use cases:
>
> https://lore.kernel.org/linux-mm/20150305185112.GL4280@redhat.com/
>
Thanks for sharing the link. I assume the 20% performance gap in
UFFDIO_COPY vs UFFDIO_REMAP is
just for ioctl calls. But (at least) in case of compaction (our use
case), COPY increases other overheads.
It leads to more page allocations, mem-copies, and madvises than
required. OTOH, with REMAP:

1) Page allocations can be mostly avoided by recycling the pages as
they are freed during compaction
2) Memcpy (for compacting objects) into the page (from (1)) is needed
only once (as compared to COPY wherein it does another memcpy).
Furthermore, as described in the RFC, sometimes even 1 memcpy isn't
required (with REMAP)
3) As pages are being recycled in userspace, there would be far fewer
pages to madvise at the end of compaction.

Also, as described in the RFC, REMAP allows moving pages within heap
for page-level coarse-grained compaction, which helps by avoiding
swapping in the page. This wouldn't be possible with COPY.

> But I am not sure of the latter one.  Maybe Mike will know.
>
> >
> > >
> > > >
> > > > II) MOTIVATION:
> > > > Garbage collectors (like the ones used in managed languages) perform
> > > > defragmentation of the managed heap by moving objects (of varying
> > > > sizes) within the heap. Usually these algorithms have to be concurrent
> > > > to avoid response time concerns. These are concurrent in the sense
> > > > that while the GC threads are compacting the heap, application threads
> > > > continue to make progress, which means enabling access to the heap
> > > > while objects are being simultaneously moved.
> > > >
> > > > Given the high overhead of heap compaction, such algorithms typically
> > > > segregate the heap into two types of regions (set of contiguous
> > > > pages): those that have enough fragmentation to compact, and those
> > > > that are densely populated. While only ‘fragmented’ regions are
> > > > compacted by sliding objects, both types of regions are traversed to
> > > > update references in them to the moved objects.
> > > >
> > > > A) PROT_NONE+SIGSEGV approach:
> > > > One of the widely used techniques to ensure data integrity during
> > > > concurrent compaction is to use page-level access interception.
> > > > Traditionally, this is implemented by mprotecting (PROT_NONE) the heap
> > > > before starting compaction and installing a SIGSEGV handler. When GC
> > > > threads are compacting the heap, if some application threads fault on
> > > > the heap, then they compact the faulted page in the SIGSEGV handler
> > > > and then enable access to it before returning. To do this atomically,
> > > > the heap must use shmem (MAP_SHARED) so that an alias mapping (with
> > > > read-write permission) can be used for moving objects into and
> > > > updating references.
> > > >
> > > > Limitation: due to different access rights, the heap can end up with
> > > > one vma per page in the worst case, hitting the ‘max_map_count’ limit.
> > > >
> > > > B) Userfaultfd approach:
> > > > Userfaultfd avoids the vma split issue by intercepting page-faults
> > > > when the page is missing and gives control to user-space to map the
> > > > desired content. It doesn’t affect the vma properties. The compaction
> > > > algorithm in this case works by first remapping the heap pages (using
> > > > mremap) to a secondary mapping and then registering the heap with
> > > > userfaultfd for MISSING faults. When an application thread accesses a
> > > > page that has not yet been mapped (by other GC/application threads), a
> > > > userfault occurs, and as a consequence the corresponding page is
> > > > generated and mapped using one of the following two ioctls.
> > > > 1) COPY ioctl: Typically the heap would be private anonymous in this
> > > > case. For every page on the heap, compact the objects into a
> > > > page-sized buffer, which COPY ioctl takes as input. The ioctl
> > > > allocates a new page, copies the input buffer to it, and then maps it.
> > > > This means that even for updating references in the densely populated
> > > > regions (where compaction is not done), in-place updation is
> > > > impossible. This results in unnecessary page-clear, memcpy and
> > > > freeing.
> > > > 2) CONTINUE ioctl: the two mappings (heap and secondary) are
> > > > MAP_SHARED to the same shmem file. Userfaults in the ‘fragmented’
> > > > regions are MISSING, in which case objects are compacted into the
> > > > corresponding secondary mapping page (which triggers a regular page
> > > > fault to get a page mapped) and then CONTINUE ioctl is invoked, which
> > > > maps the same page on the heap mapping. On the other hand, userfaults
> > > > in the ‘densely populated’ regions are MINOR (as the page already
> > > > exists in the secondary mapping), in which case we update the
> > > > references in the already existing page on the secondary mapping and
> > > > then invoke CONTINUE ioctl.
> > > >
> > > > Limitation: we observed in our implementation that
> > > > page-faults/page-allocation, memcpy, and madvise took (with either of
> > > > the two ioctls) ~50% of the time spent in compaction.
> > >
> > > I assume "page-faults" applies to CONTINUE, while "page-allocation" applies
> > > to COPY here.  UFFDIO_REMAP can definitely avoid memcpy, but I don't know
> > > how much it'll remove in total, e.g., I don't think page faults can be
> > > avoided anyway?  Also, madvise(), depending on what it is.  If it's only
> > > MADV_DONTNEED, maybe it'll be helpful too so the library can reuse wasted
> > > pages directly hence reducing DONTNEEDs.
> > >
> > That's right. page-faults -> CONTINUE and page-allocation -> COPY. The
> > GC algorithm
> > I'm describing here is mostly page-fault free as the heap pages are recycled.
> >
> > Basically, the heap is mremapped to a secondary mapping so that we can
> > start receiving MISSING faults
> > on the heap after userfaultfd registration. Consequently, on every
> > MISSING userfault, the pages from the
> > secondary mapping are prepared in-place before acting as 'src' for
> > UFFDIO_REMAP ioctl call.
> >
> > Also, as you said, MADV_DONTNEED will be mostly eliminated as most of
> > the pages are recycled in userspace.
> >
> > There are other things too that UFFDIO_REMAP enables us to do. It
> > allows coarse-grained page-by-page compaction
> > of the heap without swapping-in the pages. This isn't possible today.
> >
> > > > III) USE CASE (of the proposed feature):
> > > > The proposed feature of moving pages from one vma to another will
> > > > enable us to:
> > > > A) Recycle pages entirely in the userspace as they are freed (pages
> > > > whose objects are already consumed as part of the current compaction
> > > > cycle) in the ‘fragmented’ regions. This way we avoid page-clearing
> > > > (during page allocation) and memcpy (in the kernel). When the page is
> > > > handed over to the kernel for remapping, there is nothing else needed
> > > > to be done. Furthermore, since the page is being reused, it doesn’t
> > > > have to be freed either.
> > > > B) Implement a coarse-grained page-level compaction algorithm wherein
> > > > pages containing live objects are slid next to each other without
> > > > touching them, while reclaiming in-between pages which contain only
> > > > garbage. Such an algorithm is very useful for compacting objects which
> > > > are seldom accessed by application and hence are likely to be swapped
> > > > out. Without this feature, this would require copying the pages
> > > > containing live objects, for which the src pages have to be
> > > > swapped-in, only to be soon swapped-out afterwards.
> > > >
> > > > AFAIK, none of the above features can be implemented using mremap
> > > > (with current flags), irrespective of whether the heap is a shmem or
> > > > private anonymous mapping, because:
> > > > 1) When moving a page it’s likely that its index will need to change
> > > > and mremapping such a page would result in VMA splitting.
> > > > 2) Using mremap for moving pages would result in the heap’s range
> > > > being covered by several vmas. The mremap in the next compaction cycle
> > > > (required prior to starting compaction as described above), will fail
> > > > with EFAULT. This is because the src range in mremap is not allowed to
> > > > span multiple vmas. On the other hand, calling it for each src vma is
> > > > not feasible because:
> > > >   a) It’s not trivial to identify various vmas covering the heap range
> > > > in userspace, and
> > > >   b) This operation is supposed to happen with application threads
> > > > paused. Invoking numerous mremap syscalls in a pause risks causing
> > > > janks.
> > > > 3) Mremap has scalability concerns due to the need to acquire mmap_sem
> > > > exclusively for splitting/merging VMAs. This would impact parallelism
> > > > of application threads, particularly during the beginning of the
> > > > compaction process when they are expected to cause a spurt of
> > > > userfaults.
> > > >
> > > >
> > > > IV) PROPOSAL:
> > > > Initially, maybe the feature can be implemented only for private
> > > > anonymous mappings. There are two ways this can be implemented:
> > > > A) A new userfaultfd ioctl, ‘MOVE’, which takes the same inputs as the
> > > > ‘COPY’ ioctl. After sanity check, the ioctl would detach the pte
> > > > entries from the src vma, and move them to dst vma while updating
> > > > their ‘mapping’ and ‘index’ fields, if required.
> > > >
> > > > B) Add a new flag to mremap, ‘MREMAP_ONLYPAGES’, which works similar
> > > > to the MOVE ioctl above.
> > > >
> > > > Assuming (A) is implemented, here is broadly how the compaction would work:
> > > > * For a MISSING userfault in the ‘densely populated’ regions, update
> > > > pointers in-place in the secondary mapping page corresponding to the
> > > > fault address (on the heap) and then use the MOVE ioctl to map it on
> > > > the heap. In this case the ‘index’ field would remain the same.
> > > > * For a MISSING userfault in ‘fragmented’ regions, pick any freed page
> > > > in the secondary map, compact the objects corresponding to the fault
> > > > address in this page and then use MOVE ioctl to map it on the fault
> > > > address in the heap. This would require updating the ‘index’ field.
> > > > After compaction is completed, use madvise(MADV_DONTNEED) on the
> > > > secondary mapping to free any remaining pages.
> > > >
> > > >
> > > > Thanks,
> > > > Lokesh
> > > >
> > >
> > > --
> > > Peter Xu
>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-05-08 22:56       ` Lokesh Gidra
@ 2023-05-16 16:43         ` Peter Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2023-05-16 16:43 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Andrea Arcangeli, Axel Rasmussen, Andrew Morton,
	open list:MEMORY MANAGEMENT, linux-kernel, Kirill A . Shutemov,
	Kirill A. Shutemov, Brian Geffon, Suren Baghdasaryan,
	Kalesh Singh, Nicolas Geoffray, Jared Duke, android-mm,
	Blake Caldwell, Mike Rapoport

On Mon, May 08, 2023 at 03:56:50PM -0700, Lokesh Gidra wrote:
> On Tue, Apr 11, 2023 at 8:14 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Apr 10, 2023 at 12:41:31AM -0700, Lokesh Gidra wrote:
> > > On Thu, Apr 6, 2023 at 10:29 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Hi, Lokesh,
> > > >
> > > > Sorry for a late reply.  Copy Blake Caldwell and Mike too.
> > >
> > > Thanks for the reply. It's extremely helpful.
> > > >
> > > > On Thu, Feb 16, 2023 at 02:27:11PM -0800, Lokesh Gidra wrote:
> > > > > I) SUMMARY:
> > > > > Requesting comments on a new feature which remaps pages from one
> > > > > private anonymous mapping to another, without altering the vmas
> > > > > involved. Two alternatives exist but both have drawbacks:
> > > > > 1. userfaultfd ioctls allocate new pages, copy data and free the old
> > > > > ones even when updates could be done in-place;
> > > > > 2. mremap results in vma splitting in most of the cases due to 'pgoff' mismatch.
> > > >
> > > > Personally it was always a mistery to me on how vm_pgoff works with
> > > > anonymous vmas and why it needs to be setup with vm_start >> PAGE_SHIFT.
> > > >
> > > > Just now I tried to apply below oneliner change:
> > > >
> > > > @@ -1369,7 +1369,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> > > >                         /*
> > > >                          * Set pgoff according to addr for anon_vma.
> > > >                          */
> > > > -                       pgoff = addr >> PAGE_SHIFT;
> > > > +                       pgoff = 0;
> > > >                         break;
> > > >                 default:
> > > >                         return -EINVAL;
> > > >
> > > > The kernel even boots without a major problem so far..
> > > >
> > > > I had a feeling that I miss something else here, it'll be great if anyone
> > > > knows.
> > > >
> > > > Anyway, I agree mremap() is definitely not the best way to do page level
> > > > operations like this, no matter whether vm_pgoff can match or not.
> > > >
> > > > >
> > > > > Proposing a new mremap flag or userfaultfd ioctl which enables
> > > > > remapping pages without these drawbacks. Such a feature, as described
> > > > > below, would be very helpful in efficient implementation of concurrent
> > > > > compaction algorithms.
> > > >
> > > > After I read the proposal, I had a feeling that you're not aware that we
> > > > have similar proposals adding UFFDIO_REMAP.
> > >
> > > Yes, I wasn't aware of this. Thanks a lot for sharing the details.
> > > >
> > > > I think it started with Andrea's initial proposal on the whole uffd:
> > > >
> > > > https://lore.kernel.org/linux-mm/1425575884-2574-1-git-send-email-aarcange@redhat.com/
> > > >
> > > > Then for some reason it's not merged in initial version, but at least it's
> > > > been proposed again here (even though it seems the goal is slightly
> > > > different; that may want to move page out instead of moving in):
> > > >
> > > > https://lore.kernel.org/linux-mm/cover.1547251023.git.blake.caldwell@colorado.edu/
> > >
> > > Yeah, this seems to be the opposite of what I'm looking for. IIUC,
> > > page out REMAP can't
> > > satisfy any MISSING userfault. In fact, it enables MISSING faults in
> > > future. Maybe a flag
> > > can be added to uffdio_remap struct to accommodate this case, if it is
> > > still being pursued.
> >
> > Yes, I don't think that's a major problem if the use cases share mostly the
> > same fundation.
> >
> > > >
> > > > Also worth checking with the latest commit that Andrea maintains himself (I
> > > > doubt whether there's major changes, but still just to make it complete):
> > > >
> > > > https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc19e92
> > > >
> > > > So far I think that's what you're looking for. I'm not sure whether the
> > > > limitations will be a problem, though, at least mentioned in the old
> > > > proposals of UFFDIO_REMAP.  For example, it required not only anonymous but
> > > > also mapcount==1 on all src pages.  But maybe that's not a problem here
> > > > too.
> > >
> > > Yes, this is exactly what I am looking for. The mapcount==1 is not a
> > > problem either. Any idea why the patch isn't merged?
> >
> > The initial verion of discussion mentioned some of the reason of lacking
> > use cases:
> >
> > https://lore.kernel.org/linux-mm/20150305185112.GL4280@redhat.com/
> >
> Thanks for sharing the link. I assume the 20% performance gap in
> UFFDIO_COPY vs UFFDIO_REMAP is
> just for ioctl calls. But (at least) in case of compaction (our use
> case), COPY increases other overheads.

Per my read:

        Yes, we already measured the UFFDIO_COPY is faster than
        UFFDIO_REMAP, the userfault latency decreases -20%.

It was the fault latency so it can be more than the pure ioctl
measurements.  However I think the point is valid that for this specific
use case it's not purely adding memory but also including removals.  It
seems indeed a proper use case to me at least for what I can see now.

> It leads to more page allocations, mem-copies, and madvises than
> required. OTOH, with REMAP:
> 
> 1) Page allocations can be mostly avoided by recycling the pages as
> they are freed during compaction
> 2) Memcpy (for compacting objects) into the page (from (1)) is needed
> only once (as compared to COPY wherein it does another memcpy).
> Furthermore, as described in the RFC, sometimes even 1 memcpy isn't
> required (with REMAP)
> 3) As pages are being recycled in userspace, there would be far fewer
> pages to madvise at the end of compaction.
> 
> Also, as described in the RFC, REMAP allows moving pages within heap
> for page-level coarse-grained compaction, which helps by avoiding
> swapping in the page. This wouldn't be possible with COPY.

Please feel free to pick up the work if you think that's the right one for
you.  IMHO it'll be very helpful if you can justify how REMAP could improve
the use case in the cover letter with some real numbers.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-04-13 15:36         ` Peter Xu
@ 2023-06-06 20:15           ` Vlastimil Babka
  2023-06-06 23:18             ` Suren Baghdasaryan
  0 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2023-06-06 20:15 UTC (permalink / raw)
  To: Peter Xu, David Hildenbrand
  Cc: Lokesh Gidra, Axel Rasmussen, Andrew Morton,
	open list:MEMORY MANAGEMENT, linux-kernel, Andrea Arcangeli,
	Kirill A . Shutemov, Kirill A. Shutemov, Brian Geffon,
	Suren Baghdasaryan, Kalesh Singh, Nicolas Geoffray, Jared Duke,
	android-mm, Blake Caldwell, Mike Rapoport

On 4/13/23 17:36, Peter Xu wrote:
> On Thu, Apr 13, 2023 at 10:10:44AM +0200, David Hildenbrand wrote:
>> So instead, we consider the whole address space as a virtual, anon file,
>> starting at offset 0. The pgoff of a VMA is then simply the offset in that
>> virtual file (easily computed from the start of the VMA), and VMA merging is
>> just the same as for an ordinary file.
> 
> Interesting point, thanks!

FYI, I've advised a master thesis exploring how to update page->index during
mremap() to keep things mergeable:

https://dspace.cuni.cz/bitstream/handle/20.500.11956/176288/120426800.pdf

I think the last RFC posting was:
https://lore.kernel.org/all/20220516125405.1675-1-matenajakub@gmail.com/

It was really tricky for the general case. Maybe it would be more feasible
for the limited case Lokesh describes, if we could be sure the pages that
are moved aren't mapped anywhere else.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-06-06 20:15           ` Vlastimil Babka
@ 2023-06-06 23:18             ` Suren Baghdasaryan
  2023-06-08 10:05               ` Lokesh Gidra
  0 siblings, 1 reply; 15+ messages in thread
From: Suren Baghdasaryan @ 2023-06-06 23:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Peter Xu, David Hildenbrand, Lokesh Gidra, Axel Rasmussen,
	Andrew Morton, open list:MEMORY MANAGEMENT, linux-kernel,
	Andrea Arcangeli, Kirill A . Shutemov, Kirill A. Shutemov,
	Brian Geffon, Kalesh Singh, Nicolas Geoffray, Jared Duke,
	android-mm, Blake Caldwell, Mike Rapoport

On Tue, Jun 6, 2023 at 1:15 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 4/13/23 17:36, Peter Xu wrote:
> > On Thu, Apr 13, 2023 at 10:10:44AM +0200, David Hildenbrand wrote:
> >> So instead, we consider the whole address space as a virtual, anon file,
> >> starting at offset 0. The pgoff of a VMA is then simply the offset in that
> >> virtual file (easily computed from the start of the VMA), and VMA merging is
> >> just the same as for an ordinary file.
> >
> > Interesting point, thanks!
>
> FYI, I've advised a master thesis exploring how to update page->index during
> mremap() to keep things mergeable:
>
> https://dspace.cuni.cz/bitstream/handle/20.500.11956/176288/120426800.pdf
>
> I think the last RFC posting was:
> https://lore.kernel.org/all/20220516125405.1675-1-matenajakub@gmail.com/
>
> It was really tricky for the general case. Maybe it would be more feasible
> for the limited case Lokesh describes, if we could be sure the pages that
> are moved aren't mapped anywhere else.

Lokesh asked me to pick up this work and prepare patches for
upstreaming. I'll start working on them after I finish with per-vma
lock support for swap and userfaultd (targeting later this week).
Thanks for all the input folks!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-04-13  8:10       ` David Hildenbrand
  2023-04-13 15:36         ` Peter Xu
@ 2023-06-07 20:17         ` Lorenzo Stoakes
  1 sibling, 0 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2023-06-07 20:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Lokesh Gidra, Axel Rasmussen, Andrew Morton,
	open list:MEMORY MANAGEMENT, linux-kernel, Andrea Arcangeli,
	Kirill A . Shutemov, Kirill A. Shutemov, Brian Geffon,
	Suren Baghdasaryan, Kalesh Singh, Nicolas Geoffray, Jared Duke,
	android-mm, Blake Caldwell, Mike Rapoport

On Thu, Apr 13, 2023 at 10:10:44AM +0200, David Hildenbrand wrote:
> For RMAP and friends (relying on linear_page_index), folio->index has to
> match the index within the VMA. If would set pgoff to something else, we'd
> have less VMA merging opportunities. So your system might work, but you'd
> end up with many anon VMAs.

I thik the reverse situation, i.e. splitting the VMA, is the more serious
one, and without a correct index would simply break rmap.

Consider:-

     [ VMA ]
        ^
        |
     [ avc ]
        ^
        |
   [ anon_vma ]
    ^    ^    ^
   /     |     \
page 1 page 2 page 3

If we unmap page 2, we cannot (or would rather not) update page 1 and page
3 to point to a new anon_vma and instead end up with:-

 [ VMA 1 ]  [ VMA 3 ]
     ^          ^
     |          |
  [ avc ]    [ avc ]
     ^          ^
      \        /
     [ anon_vma ]
      ^         ^
     /           \
  page 1        page 3

Now you need some means of knowing which VMA each belongs to - we have to
use the folio->index to look up which anon_vma_chain (avc) in the
anon_vma's interval tree (which is keyed on folio->index) contains its VMA
(actually this could be multiple VMAs due to forking).

mremap() seems to me to be a lot of the reason we don't just put
vma->vm_start >> PAGE_SHIFT in folio->index the fly, as when a block of
memory is moved, we don't want to have to go and update all of the
underlying pages, so we just keep the vm_pgoff the same as the old position
even after it's moved. We keep this in vm_pgoff so we know what pgoff's to
give to new pages to put in their index fields.

As a result, we obviously wouldn't want to merge an mremap'd VMA with that
special handling with one that didn't have it to avoid the pages not being
able to be rmap'd back to the correct VMAs, so requiring vm_pgoff to be
linearly monotonically increasing across the merged range achieves this.

Doing it this way keeps the code for the VMA manipulation logic the same
for file-backed and anon mappings so is (kind of) neat in that respect.

Oh as a point of interest there is _yet another_ thing that can go in
vm_pgoff, which is remapped kernel mappings via remap_pfn_range_notrack()
which puts PFN in there :))

(as you can imagine I've torn out my rapidly diminishing hair writing about
this stuff in the book)

>
>
> Imagine the following:
>
> [ anon0 ][  fd   ][ anon1 ]
>
> Unmap the fd:
>
> [ anon0 ][ hole  ][ anon1 ]
>
> Mmap anon:
>
> [ anon0 ][ anon2 ][ anon1 ]
>
>
> We can now merge all 3 VMAs into one, even if the first and latter already
> map pages.
>
>
> A simpler and more common example is probably:
>
> [ anon0 ]
>
> Mmmap anon1 before the existing one
>
> [ anon1 ][ anon0 ]
>
> Which we can merge into a single one.
>
>
>
> Mapping after an existing one could work, but one would have to carefully
> set pgoff based on the size of the previous anon VMA ... which is more
> complicated
>
> So instead, we consider the whole address space as a virtual, anon file,
> starting at offset 0. The pgoff of a VMA is then simply the offset in that
> virtual file (easily computed from the start of the VMA), and VMA merging is
> just the same as for an ordinary file.

This is a very good way of explaining it (though mremap complicates things
somewhat).

>
> --
> Thanks,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-06-06 23:18             ` Suren Baghdasaryan
@ 2023-06-08 10:05               ` Lokesh Gidra
  2023-09-14 15:30                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 15+ messages in thread
From: Lokesh Gidra @ 2023-06-08 10:05 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Peter Xu, David Hildenbrand, Axel Rasmussen,
	Andrew Morton, open list:MEMORY MANAGEMENT, linux-kernel,
	Andrea Arcangeli, Kirill A . Shutemov, Kirill A. Shutemov,
	Brian Geffon, Kalesh Singh, Nicolas Geoffray, Jared Duke,
	android-mm, Blake Caldwell, Mike Rapoport

On Tue, Jun 6, 2023 at 4:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jun 6, 2023 at 1:15 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 4/13/23 17:36, Peter Xu wrote:
> > > On Thu, Apr 13, 2023 at 10:10:44AM +0200, David Hildenbrand wrote:
> > >> So instead, we consider the whole address space as a virtual, anon file,
> > >> starting at offset 0. The pgoff of a VMA is then simply the offset in that
> > >> virtual file (easily computed from the start of the VMA), and VMA merging is
> > >> just the same as for an ordinary file.
> > >
> > > Interesting point, thanks!
> >
> > FYI, I've advised a master thesis exploring how to update page->index during
> > mremap() to keep things mergeable:
> >
> > https://dspace.cuni.cz/bitstream/handle/20.500.11956/176288/120426800.pdf
> >
> > I think the last RFC posting was:
> > https://lore.kernel.org/all/20220516125405.1675-1-matenajakub@gmail.com/
> >
> > It was really tricky for the general case. Maybe it would be more feasible
> > for the limited case Lokesh describes, if we could be sure the pages that
> > are moved aren't mapped anywhere else.

It's great that mremap is being improved for mereabilitly. However,
IIUC, it would still cause unnecessary splits and merges in the
private anonymous case. Also, mremap works with mmap_sem exclusively
held, thereby impacting scalability of concurrent mremap calls.

IMHO, Andrea's userfaultfd REMAP patch is a better alternative as it
doesn't have these downsides.

>
> Lokesh asked me to pick up this work and prepare patches for
> upstreaming. I'll start working on them after I finish with per-vma
> lock support for swap and userfaultd (targeting later this week).
> Thanks for all the input folks!

Thanks so much, Suren!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC for new feature to move pages from one vma to another without split
  2023-06-08 10:05               ` Lokesh Gidra
@ 2023-09-14 15:30                 ` Suren Baghdasaryan
  0 siblings, 0 replies; 15+ messages in thread
From: Suren Baghdasaryan @ 2023-09-14 15:30 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Vlastimil Babka, Peter Xu, David Hildenbrand, Axel Rasmussen,
	Andrew Morton, open list:MEMORY MANAGEMENT, linux-kernel,
	Andrea Arcangeli, Kirill A . Shutemov, Kirill A. Shutemov,
	Brian Geffon, Kalesh Singh, Nicolas Geoffray, Jared Duke,
	android-mm, Blake Caldwell, Mike Rapoport

On Thu, Jun 8, 2023 at 3:05 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Tue, Jun 6, 2023 at 4:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Jun 6, 2023 at 1:15 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > On 4/13/23 17:36, Peter Xu wrote:
> > > > On Thu, Apr 13, 2023 at 10:10:44AM +0200, David Hildenbrand wrote:
> > > >> So instead, we consider the whole address space as a virtual, anon file,
> > > >> starting at offset 0. The pgoff of a VMA is then simply the offset in that
> > > >> virtual file (easily computed from the start of the VMA), and VMA merging is
> > > >> just the same as for an ordinary file.
> > > >
> > > > Interesting point, thanks!
> > >
> > > FYI, I've advised a master thesis exploring how to update page->index during
> > > mremap() to keep things mergeable:
> > >
> > > https://dspace.cuni.cz/bitstream/handle/20.500.11956/176288/120426800.pdf
> > >
> > > I think the last RFC posting was:
> > > https://lore.kernel.org/all/20220516125405.1675-1-matenajakub@gmail.com/
> > >
> > > It was really tricky for the general case. Maybe it would be more feasible
> > > for the limited case Lokesh describes, if we could be sure the pages that
> > > are moved aren't mapped anywhere else.
>
> It's great that mremap is being improved for mereabilitly. However,
> IIUC, it would still cause unnecessary splits and merges in the
> private anonymous case. Also, mremap works with mmap_sem exclusively
> held, thereby impacting scalability of concurrent mremap calls.
>
> IMHO, Andrea's userfaultfd REMAP patch is a better alternative as it
> doesn't have these downsides.
>
> >
> > Lokesh asked me to pick up this work and prepare patches for
> > upstreaming. I'll start working on them after I finish with per-vma
> > lock support for swap and userfaultd (targeting later this week).
> > Thanks for all the input folks!
>
> Thanks so much, Suren!

I just posted the patchset at
https://lore.kernel.org/all/20230914152620.2743033-1-surenb@google.com/.
I tried to keep it as true to Andrea's original as possible but still
had to make some sizable changes, which I described in the cover
letter. Feedback would be much appreciated!


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-09-14 15:31 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-16 22:27 RFC for new feature to move pages from one vma to another without split Lokesh Gidra
2023-04-06 17:29 ` Peter Xu
2023-04-10  7:41   ` Lokesh Gidra
2023-04-11 15:14     ` Peter Xu
2023-05-08 22:56       ` Lokesh Gidra
2023-05-16 16:43         ` Peter Xu
2023-04-12  8:47   ` David Hildenbrand
2023-04-12 15:58     ` Peter Xu
2023-04-13  8:10       ` David Hildenbrand
2023-04-13 15:36         ` Peter Xu
2023-06-06 20:15           ` Vlastimil Babka
2023-06-06 23:18             ` Suren Baghdasaryan
2023-06-08 10:05               ` Lokesh Gidra
2023-09-14 15:30                 ` Suren Baghdasaryan
2023-06-07 20:17         ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).