From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AF445C7618D
	for <linux-mm@archiver.kernel.org>; Thu,  6 Apr 2023 17:29:50 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 258546B0075; Thu,  6 Apr 2023 13:29:50 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 206D06B0078; Thu,  6 Apr 2023 13:29:50 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0CF616B007B; Thu,  6 Apr 2023 13:29:50 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id F13E06B0075
	for <linux-mm@kvack.org>; Thu,  6 Apr 2023 13:29:49 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id C25651A1053
	for <linux-mm@kvack.org>; Thu,  6 Apr 2023 17:29:49 +0000 (UTC)
X-FDA: 80651653698.22.FDAB8CB
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf10.hostedemail.com (Postfix) with ESMTP id 9E0DEC0005
	for <linux-mm@kvack.org>; Thu,  6 Apr 2023 17:29:46 +0000 (UTC)
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Y6MThdOy;
	spf=pass (imf10.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680802186; a=rsa-sha256;
	cv=none;
	b=oqUHqcRSTSBZO++Vz9ToEcbm7EaFjCuLi6VsnJqYc4Bz0aKJF90cFTsJvGQ/i0hbvqHUuG
	hhOCjKHsezZlwQfQZMDbD9dq8aairsPfRHKpD+3WrRqLbInDXB0HS2x7ulWXSeZv3GD+at
	b87/6SY+J2as3xTYyHFppuMqRWfn2Gs=
ARC-Authentication-Results: i=1;
	imf10.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Y6MThdOy;
	spf=pass (imf10.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1680802186;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Vw7f0KoSGuOEjpxfcREMEJkC0VJxGOcWy3MRkdHX1ec=;
	b=Ze0aPvw7xzZZv8A6hde/sXksZlwirKcEb8z6L4lmggPYBw+BCHx8XBsYzETv1E2mjwCb/v
	7GcfFhQN6Zd484ZDYMTTTR8YoykDfl7NMAmS0iTiLUFRM6mJ3X8OT1Dtt8+7ocJe3XKUAj
	IvJcyzH3yoM3spJuA7tLOUr6MvBVeMI=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1680802186;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Vw7f0KoSGuOEjpxfcREMEJkC0VJxGOcWy3MRkdHX1ec=;
	b=Y6MThdOy+YZD0905uEhjPUvNM8PVE/0b1G2Mgz5JTLSLhzYAeRBIQEIOu3QuwkL9isnd7s
	UWFdPSFeeRn8ciLP+jF3OZjO2PDs/ps++y2RIZJa5QMFselZfcHp3u8CtkEVd1R4AOJfOH
	Rpc/4JwYb8+oKgAR1HSKsJkKOUynm4w=
Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com
 [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-166-YjB-E4iXPQ-84ExiNuOvbg-1; Thu, 06 Apr 2023 13:29:45 -0400
X-MC-Unique: YjB-E4iXPQ-84ExiNuOvbg-1
Received: by mail-qk1-f197.google.com with SMTP id af79cd13be357-74a25fcda05so9912885a.1
        for <linux-mm@kvack.org>; Thu, 06 Apr 2023 10:29:45 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1680802180; x=1683394180;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=Vw7f0KoSGuOEjpxfcREMEJkC0VJxGOcWy3MRkdHX1ec=;
        b=LMGZsiMGPT66gw8FcIAeRwaJQf1qfuZKRvlBlW8TNvPiJWIAsII2ijrcWZPzVRthYQ
         RvJPjW+xhkpquGJBpiDqUj2g4ZLKB1Q9z1je9PneHPzI7usUGJdhh8tzVgXAdVUDWMeX
         PWM7HuvJTGE6x3uVnG4Fa2EiCzP62wKzpORfz14iAZxytCOjh8MhSg0kkzHbXeXQxu5d
         rQOJbSKN3eEUbuoZs7NYpVMjpOnzAMIK5mkvmMiXGijDMNC+Xg2ttxsc6j7NpFE+MnOx
         oJNLgpSi0vh31WyMLBdd67xg/tCyjqk8wj+SPcQtf3QCf95KhYCTMEAYBrR+2D9gJS/R
         kxow==
X-Gm-Message-State: AAQBX9eKI5gHxI6pCItONDeS0YrNslQHcNMllkSBVKG6n0xUnDKxdZNZ
	UYRE+1VJzcm4VzX6PYrpsVDXfaZO6fx4esEBU7kMwNmXJUA7x1yxzucimVualNpx2GsD3axPgrc
	68utF/FO1/Po=
X-Received: by 2002:a05:6214:2128:b0:5af:3a13:202d with SMTP id r8-20020a056214212800b005af3a13202dmr9680qvc.4.1680802179555;
        Thu, 06 Apr 2023 10:29:39 -0700 (PDT)
X-Google-Smtp-Source: AKy350a3ZTMB0IatJfT/1sL9nUW8HCwLuqtw/AGwFoByzJVSxapUR7VJ2QRgxl/ae/mjTVqaQiN7lw==
X-Received: by 2002:a05:6214:2128:b0:5af:3a13:202d with SMTP id r8-20020a056214212800b005af3a13202dmr9642qvc.4.1680802179073;
        Thu, 06 Apr 2023 10:29:39 -0700 (PDT)
Received: from x1n (bras-base-aurron9127w-grc-40-70-52-229-124.dsl.bell.ca. [70.52.229.124])
        by smtp.gmail.com with ESMTPSA id bn5-20020a05620a2ac500b0074a1d2a17c8sm626430qkb.29.2023.04.06.10.29.37
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 06 Apr 2023 10:29:38 -0700 (PDT)
Date: Thu, 6 Apr 2023 13:29:36 -0400
From: Peter Xu <peterx@redhat.com>
To: Lokesh Gidra <lokeshgidra@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Brian Geffon <bgeffon@google.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Nicolas Geoffray <ngeoffray@google.com>,
	Jared Duke <jdduke@google.com>, android-mm <android-mm@google.com>,
	Blake Caldwell <blake.caldwell@colorado.edu>,
	Mike Rapoport <rppt@kernel.org>
Subject: Re: RFC for new feature to move pages from one vma to another
 without split
Message-ID: <ZC8BgFSFC3cDcAcS@x1n>
References: <CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com>
MIME-Version: 1.0
In-Reply-To: <CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
X-Rspam-User: 
X-Rspamd-Queue-Id: 9E0DEC0005
X-Rspamd-Server: rspam01
X-Stat-Signature: g3zo476ynrkxjhcgoqf8d546bqwbkp7e
X-HE-Tag: 1680802186-692179
X-HE-Meta: U2FsdGVkX18FR6ViTaX4MJwYGI+QzNGuQqsAVHLJHaRvavPjVYlO0KgxciBd4ObUQNHus6DamrYCVPbOUv/HHflvzDZ33oFoXLCRMcWAEpgYncmQIIJHyvU6nyXD6KglqZ438zb8F6hA7ryyQLQWdrmVQF5aLAdb5kamEegwNb1keUxl6FDizouYZx7cZS9X3wWvpqEUS4mUilh43CRqdU7bVMXhtPdZ4OjO1DuawqzqsdT07KV0C4xdfbtXRwJOZx39edsmjh/OyPArKX/28A8/AQbmIRfghqK/ONIs9NDucdyO75Odn9InDRofnEI64VrbmPnjGAn4T5SS1mqgGvkLVkuF67TFibAnOPfm+yA+BJNDNmQioQ5nJy4lGBeP91wZxbnnT8O1GkMi8APjpCwp9rS8jNWJf56p2/ccA5BhRfqoSFuY1KQaNL6FsVhjUkTWq8l/YBooDKS7wHfCy7riEdORvF4ShF5ZIBLu8Z8ZB4m5p2GFsHUiNjbQ1QqzI3j5poYNNclVjUdtCA1fc04y6FRyxXUI+uTBSG9DD/iBJ0CkIagGHfGV46N6E/6SUTBu8ae4dBLKb0w1jDYuahcmD47pdy6HX+v9WjJ5hPSjdWvGpw6+jvq/8eSmgGpGdtfhuur4YargvLHiNBt+gjtMTH+2Pp3ASo3tINJhOwI2K12vwJ7SDusRIxxQh+HmhAwZiEt/BdmSlkiFP1Q0YVJCTFmRA4Q2QklzwEQ37tAkcYPsEJyqxdIBd4F4OlY7bEl5UEV4DvL4YR0WLBG/O3Mu/CjStta7iTTnXHvInO+4S5doN4kJ3lXrm9CenmyGRYrzwlzDzgsEDNV2/wGY06hI1bHakdfkwkQFXwterBdJeX+IHMt4jCy5rlsfpB1/3jjxBunZwxzUQiclP7NnKyQhJNwcdFYLueWs6EPGFeDtkztjCjiEEJE4+JggH1ERQkx0MigJcpoqlAyvH1F
 hMN9OzKJ
 v1tCdL/IvNPO7fTuiiXLd424PE6Gxr/EuOvyP2ueNlxhyVAT6ybCaI49To7IW7oh89dNPAdCc7PICizbiKAjL9jd2cL88g+F64D9uaEbUUvpVzd+nuR9QOphuQzIhL5w2HcOOv2xEYZmdX1Wnllp9tdru3vqTYOdTBm+trVIrsswhN2YNNfaLut+mWPh2TD+UZTJTNlYEMlIdVyUNWeO+F5PkN4S8qqX2qDk8TxQXo4pGUgUVK7U3l0Xp/zTumZNznfZ1JomSH47ZJ/7jr+X8AetrWUqYSKuzrPLGCZa0D27AFyecoixyVDOp8yy9fsQuD4xUrPu7+YnR3K2NpcEpwq5FgoIptczV/F8F1ggj0sGlw3aqIeOa57A4Ef9Qz6/n+s3zsX4yDh26eCLA9Cmq8p0uI56E2kx4FYUJ1ud8VTv7nz7tQL1o5blj0g7u1tnbeSjRuoTc6v3brbe37NGomwLeXuSpftfoAxR5GUW04udWltWI2swpWaFmU0dgyynCmNmaX5nPkBi1BFN11eRdGLGVeDGtnx0ICWlK+aSGQ4+OkZA=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000005, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hi, Lokesh,

Sorry for a late reply.  Copy Blake Caldwell and Mike too.

On Thu, Feb 16, 2023 at 02:27:11PM -0800, Lokesh Gidra wrote:
> I) SUMMARY:
> Requesting comments on a new feature which remaps pages from one
> private anonymous mapping to another, without altering the vmas
> involved. Two alternatives exist but both have drawbacks:
> 1. userfaultfd ioctls allocate new pages, copy data and free the old
> ones even when updates could be done in-place;
> 2. mremap results in vma splitting in most of the cases due to 'pgoff' mismatch.

Personally it was always a mistery to me on how vm_pgoff works with
anonymous vmas and why it needs to be setup with vm_start >> PAGE_SHIFT.

Just now I tried to apply below oneliner change:

@@ -1369,7 +1369,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
                        /*
                         * Set pgoff according to addr for anon_vma.
                         */
-                       pgoff = addr >> PAGE_SHIFT;
+                       pgoff = 0;
                        break;
                default:
                        return -EINVAL;

The kernel even boots without a major problem so far..

I had a feeling that I miss something else here, it'll be great if anyone
knows.

Anyway, I agree mremap() is definitely not the best way to do page level
operations like this, no matter whether vm_pgoff can match or not.

> 
> Proposing a new mremap flag or userfaultfd ioctl which enables
> remapping pages without these drawbacks. Such a feature, as described
> below, would be very helpful in efficient implementation of concurrent
> compaction algorithms.

After I read the proposal, I had a feeling that you're not aware that we
have similar proposals adding UFFDIO_REMAP.

I think it started with Andrea's initial proposal on the whole uffd:

https://lore.kernel.org/linux-mm/1425575884-2574-1-git-send-email-aarcange@redhat.com/

Then for some reason it's not merged in initial version, but at least it's
been proposed again here (even though it seems the goal is slightly
different; that may want to move page out instead of moving in):

https://lore.kernel.org/linux-mm/cover.1547251023.git.blake.caldwell@colorado.edu/

Also worth checking with the latest commit that Andrea maintains himself (I
doubt whether there's major changes, but still just to make it complete):

https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc19e92

So far I think that's what you're looking for. I'm not sure whether the
limitations will be a problem, though, at least mentioned in the old
proposals of UFFDIO_REMAP.  For example, it required not only anonymous but
also mapcount==1 on all src pages.  But maybe that's not a problem here
too.

> 
> II) MOTIVATION:
> Garbage collectors (like the ones used in managed languages) perform
> defragmentation of the managed heap by moving objects (of varying
> sizes) within the heap. Usually these algorithms have to be concurrent
> to avoid response time concerns. These are concurrent in the sense
> that while the GC threads are compacting the heap, application threads
> continue to make progress, which means enabling access to the heap
> while objects are being simultaneously moved.
> 
> Given the high overhead of heap compaction, such algorithms typically
> segregate the heap into two types of regions (set of contiguous
> pages): those that have enough fragmentation to compact, and those
> that are densely populated. While only ‘fragmented’ regions are
> compacted by sliding objects, both types of regions are traversed to
> update references in them to the moved objects.
> 
> A) PROT_NONE+SIGSEGV approach:
> One of the widely used techniques to ensure data integrity during
> concurrent compaction is to use page-level access interception.
> Traditionally, this is implemented by mprotecting (PROT_NONE) the heap
> before starting compaction and installing a SIGSEGV handler. When GC
> threads are compacting the heap, if some application threads fault on
> the heap, then they compact the faulted page in the SIGSEGV handler
> and then enable access to it before returning. To do this atomically,
> the heap must use shmem (MAP_SHARED) so that an alias mapping (with
> read-write permission) can be used for moving objects into and
> updating references.
> 
> Limitation: due to different access rights, the heap can end up with
> one vma per page in the worst case, hitting the ‘max_map_count’ limit.
> 
> B) Userfaultfd approach:
> Userfaultfd avoids the vma split issue by intercepting page-faults
> when the page is missing and gives control to user-space to map the
> desired content. It doesn’t affect the vma properties. The compaction
> algorithm in this case works by first remapping the heap pages (using
> mremap) to a secondary mapping and then registering the heap with
> userfaultfd for MISSING faults. When an application thread accesses a
> page that has not yet been mapped (by other GC/application threads), a
> userfault occurs, and as a consequence the corresponding page is
> generated and mapped using one of the following two ioctls.
> 1) COPY ioctl: Typically the heap would be private anonymous in this
> case. For every page on the heap, compact the objects into a
> page-sized buffer, which COPY ioctl takes as input. The ioctl
> allocates a new page, copies the input buffer to it, and then maps it.
> This means that even for updating references in the densely populated
> regions (where compaction is not done), in-place updation is
> impossible. This results in unnecessary page-clear, memcpy and
> freeing.
> 2) CONTINUE ioctl: the two mappings (heap and secondary) are
> MAP_SHARED to the same shmem file. Userfaults in the ‘fragmented’
> regions are MISSING, in which case objects are compacted into the
> corresponding secondary mapping page (which triggers a regular page
> fault to get a page mapped) and then CONTINUE ioctl is invoked, which
> maps the same page on the heap mapping. On the other hand, userfaults
> in the ‘densely populated’ regions are MINOR (as the page already
> exists in the secondary mapping), in which case we update the
> references in the already existing page on the secondary mapping and
> then invoke CONTINUE ioctl.
> 
> Limitation: we observed in our implementation that
> page-faults/page-allocation, memcpy, and madvise took (with either of
> the two ioctls) ~50% of the time spent in compaction.

I assume "page-faults" applies to CONTINUE, while "page-allocation" applies
to COPY here.  UFFDIO_REMAP can definitely avoid memcpy, but I don't know
how much it'll remove in total, e.g., I don't think page faults can be
avoided anyway?  Also, madvise(), depending on what it is.  If it's only
MADV_DONTNEED, maybe it'll be helpful too so the library can reuse wasted
pages directly hence reducing DONTNEEDs.

> III) USE CASE (of the proposed feature):
> The proposed feature of moving pages from one vma to another will
> enable us to:
> A) Recycle pages entirely in the userspace as they are freed (pages
> whose objects are already consumed as part of the current compaction
> cycle) in the ‘fragmented’ regions. This way we avoid page-clearing
> (during page allocation) and memcpy (in the kernel). When the page is
> handed over to the kernel for remapping, there is nothing else needed
> to be done. Furthermore, since the page is being reused, it doesn’t
> have to be freed either.
> B) Implement a coarse-grained page-level compaction algorithm wherein
> pages containing live objects are slid next to each other without
> touching them, while reclaiming in-between pages which contain only
> garbage. Such an algorithm is very useful for compacting objects which
> are seldom accessed by application and hence are likely to be swapped
> out. Without this feature, this would require copying the pages
> containing live objects, for which the src pages have to be
> swapped-in, only to be soon swapped-out afterwards.
> 
> AFAIK, none of the above features can be implemented using mremap
> (with current flags), irrespective of whether the heap is a shmem or
> private anonymous mapping, because:
> 1) When moving a page it’s likely that its index will need to change
> and mremapping such a page would result in VMA splitting.
> 2) Using mremap for moving pages would result in the heap’s range
> being covered by several vmas. The mremap in the next compaction cycle
> (required prior to starting compaction as described above), will fail
> with EFAULT. This is because the src range in mremap is not allowed to
> span multiple vmas. On the other hand, calling it for each src vma is
> not feasible because:
>   a) It’s not trivial to identify various vmas covering the heap range
> in userspace, and
>   b) This operation is supposed to happen with application threads
> paused. Invoking numerous mremap syscalls in a pause risks causing
> janks.
> 3) Mremap has scalability concerns due to the need to acquire mmap_sem
> exclusively for splitting/merging VMAs. This would impact parallelism
> of application threads, particularly during the beginning of the
> compaction process when they are expected to cause a spurt of
> userfaults.
> 
> 
> IV) PROPOSAL:
> Initially, maybe the feature can be implemented only for private
> anonymous mappings. There are two ways this can be implemented:
> A) A new userfaultfd ioctl, ‘MOVE’, which takes the same inputs as the
> ‘COPY’ ioctl. After sanity check, the ioctl would detach the pte
> entries from the src vma, and move them to dst vma while updating
> their ‘mapping’ and ‘index’ fields, if required.
> 
> B) Add a new flag to mremap, ‘MREMAP_ONLYPAGES’, which works similar
> to the MOVE ioctl above.
> 
> Assuming (A) is implemented, here is broadly how the compaction would work:
> * For a MISSING userfault in the ‘densely populated’ regions, update
> pointers in-place in the secondary mapping page corresponding to the
> fault address (on the heap) and then use the MOVE ioctl to map it on
> the heap. In this case the ‘index’ field would remain the same.
> * For a MISSING userfault in ‘fragmented’ regions, pick any freed page
> in the secondary map, compact the objects corresponding to the fault
> address in this page and then use MOVE ioctl to map it on the fault
> address in the heap. This would require updating the ‘index’ field.
> After compaction is completed, use madvise(MADV_DONTNEED) on the
> secondary mapping to free any remaining pages.
> 
> 
> Thanks,
> Lokesh
> 

-- 
Peter Xu