From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6773EC61DA4 for ; Thu, 16 Feb 2023 22:27:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A711E6B0072; Thu, 16 Feb 2023 17:27:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A21A06B0074; Thu, 16 Feb 2023 17:27:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8EA336B0075; Thu, 16 Feb 2023 17:27:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 806706B0072 for ; Thu, 16 Feb 2023 17:27:25 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 5165B1609CD for ; Thu, 16 Feb 2023 22:27:25 +0000 (UTC) X-FDA: 80474592450.24.63B3026 Received: from mail-vk1-f171.google.com (mail-vk1-f171.google.com [209.85.221.171]) by imf05.hostedemail.com (Postfix) with ESMTP id 82D3F100011 for ; Thu, 16 Feb 2023 22:27:23 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=XbgTu8Kc; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.221.171 as permitted sender) smtp.mailfrom=lokeshgidra@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676586443; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=RIIagYEPOZjsPtFjlj+Npxy7tYezVym+nY7h8CxULgM=; b=4IXeSD2dWSFnk/FSffjCqR3Vqws8o5VCjFkLrR1JcS5j66jcpFrxE4IzOABsruMFnTFaCZ YxsjsPraztOltwbS6DvD7O3S8ISBimgbpKbcy4WmPCn1n+0khH2ecz2bGHnKVpfCb2VDjg 1HYS5DqY42uR3Fh8jQS2SzvIluc36KE= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=XbgTu8Kc; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.221.171 as permitted sender) smtp.mailfrom=lokeshgidra@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676586443; a=rsa-sha256; cv=none; b=uf3Kg8bMhFXZjzkFG1vreIJQREqjw1hR08c7Hl/EV/L2kgOiAK7djb02EBvT1zDPJ5BHxn J1CvazKLPvl3MQhfX1dD+sYsbcjSI892vwWXz6rpKpHdVfQasjcpCGHYPUtFDkhQRdz1hA lRqNuItKOTdVzhCZS/M/+85OjealIvM= Received: by mail-vk1-f171.google.com with SMTP id b81so2064904vkf.1 for ; Thu, 16 Feb 2023 14:27:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1676586442; h=content-transfer-encoding:cc:to:subject:message-id:date:from :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=RIIagYEPOZjsPtFjlj+Npxy7tYezVym+nY7h8CxULgM=; b=XbgTu8Kc8j1zKle2ISwgCQgirsgjy0fBjz/z1i2OOa80knGR86DHz9w78U7Vh0JKFe amvuDtw5vj7dq/dofItCAM8daEFZds6SpGAqj9Oy31CZawFzJmEvjE6uuDWUF5oPZODi cQSnxUQNSalM1sNv04yQZqmTCRlcoD5UoXvLyfL9cyFAReuNzLePAaNorxVM7v8Tf6xX SmYugxB4igpEwRc7N58WCOajCO9ZkIjMNFwy6w3NT0tsNN6AN2YcbcbXpHr89vjBSOYZ ThkVLsl/uPs7Eoaw/Ec6ck0FhTs7DbuqWWDwFAOJ+FkAJ+tkP6rFHwAyoiFt1hys4BGX Vu7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1676586442; h=content-transfer-encoding:cc:to:subject:message-id:date:from :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=RIIagYEPOZjsPtFjlj+Npxy7tYezVym+nY7h8CxULgM=; b=xYZe8+Dz8eXdgBQbdjQ7cWjvdGtu72TVwllBQebGIfBRtlfW1Sw0YAaWJB6GnMvWdh Z4B3CZkkxSAzGvoBEklUeLVCIvQ54uVWb4+ZPSxeK/1c51WoBnKPlukOqH6TjBfuCg3l pXnhqHKgjKU6Gz+eupNmSHJgEaQiuaqBATRUBi5+9YEK3zEb9e238lmIC2f8GGUCi+KZ QC9Dk4rl2rD1vzMJnG4hGTKuKEC40MpcGlfD+Iy47dbd0RREZ2OmFDg7674RN9626tPx axSXDLcQcLBF5b92eHUwRtwVrYRQwgkMctQJTgkbbBuxQvcaW3aXbk3uJBoMwYoOWlUw vX1g== X-Gm-Message-State: AO0yUKU2Dwb9CubqnX3tF4cUDDsMp819MBP4/1OEir0ekJrNhSqzgIIC CILRZqlYQiXqgi6YmbaoQzmUQxtC+cE1P6XR/xXMRw== X-Google-Smtp-Source: AK7set8+nYkrawdpOMwYuT/GH/1xErNmKd9P/0AexHtu6RVh1nwi3E/PBA3Iv0RdDvIxOXzJuk83s9r/k2LQuAUGNcQ= X-Received: by 2002:a05:6122:1385:b0:401:8b0a:8aa6 with SMTP id m5-20020a056122138500b004018b0a8aa6mr1090547vkp.12.1676586442407; Thu, 16 Feb 2023 14:27:22 -0800 (PST) MIME-Version: 1.0 From: Lokesh Gidra Date: Thu, 16 Feb 2023 14:27:11 -0800 Message-ID: Subject: RFC for new feature to move pages from one vma to another without split To: Peter Xu , Axel Rasmussen , Andrew Morton , "open list:MEMORY MANAGEMENT" , linux-kernel , Andrea Arcangeli , "Kirill A . Shutemov" , "Kirill A. Shutemov" Cc: Brian Geffon , Suren Baghdasaryan , Kalesh Singh , Nicolas Geoffray , Jared Duke , android-mm Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 82D3F100011 X-Stat-Signature: c47wxtk6isi11fwd79qaxfeid3yd9ggi X-HE-Tag: 1676586443-566696 X-HE-Meta: U2FsdGVkX18KGEI5jwNUeY7vmI4y3n8lyfhTNDTyGkqYzo3L8rc6YEW7FCDFCTeKPVHFxLZxst5wCNgGBLE1pzxkTFFu1CTN2Nxcv3NpE0DJbqBQfE3OeGfFAxTDar6fV8PyIKDl5eCBykbJmsvos77VJKslHcDjQdNZ6gdsgTgMBM6ORseP5Se0qF4jMI6O55qYzLmLcaNmXPp2H/GHYhCCDAdeCyrTcX25iiZc7Qt86b0WJI0A+qQc0LQBUUPFIA6dOdT8ZIOfQPSz7lZsWfkRUl0exfjvMGSq6TkKJXAT4KOysXn5Y+rFRGvohLLY8MIqCr/eTJ/BgehgbLm8yufPS1CW+GY66slEhvxaGknR1D39AB2E4z74SuPQb4vVQ+YJm4xO03Sk6s0nbkFLJL4Vnhcn9vRVfJvY6CJuNlUPu4eBi0c2cLNNhf5SGOy9doStvtDj6tV+NIXGEI9McWxUgKtizVyDD8IKCAzzTuiPNQgxXAxLhFfAA+JEyez6Ll4CsIyr/HZgkugwg2GxQs2yDW+sZSW99mYdBiVtFD6OMQPDKpWlWQ6KekA7v32SAjfLEY5kXWSpniA/sfIv2auhyCwyNk7IpJNMFV305XIgu8vZyLPdslYpCh8cL9c42/3byOSLUxgC2ZgmRbUekoM8yGKvghu7fu6PxrSFi+eMf6EG7Yyj3qj6FNg5Es83qYq4Jztj7TVynJAragFjHrOAihgBxNCAPTqZSguWOgBAY76llHSI8VnDBtiAa8RU/hjJOE+ClY7qqtOZzAnUsvCNCkwOX9mF4e689YbvmZjgHWk7ptXQFai5xrjmQ2gcATMnc4KLt/08rrog5e6pYsS3PAtgzHAQuQ0hXJvrtjeHoqB6WvWwkkXYTagEpKMJabR8M6FmP2jS10YUeUNl5PVX8RXa40iC/U7NAW7z+YLjEiTHsuufUmIBaVtgmZG1NegJqXl9vEN9Uak2KwR Lv/XfB1t ikxqxXK4zc8dJ4zDDd3L/8slbNfpN4flQ4PEHmIAAbGY5ETt7KQiosZoY6qWWRRnsCAgouTCHw4VWjLTFnL2CwUVZltf9DEgpHN9S2xAaj09FTRGBx/Nf1ElnGzzLk8rVKBllW12gSZTgsRG7VlmdanqxX067kfaRpMJXErqDeE/KF+9d4PZx/gowPtlKTPWyeuZukq5BbJnGphBqgy7PR45QkWhf1qvJhygSloNJCi+bHt19f1Tcrg2H7pCmbXn/ZrqMksgQP57HBYpg1zHe6ygKMnBfy71Dqva7afe3gSmqZOsRmN1RLSg+cA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: I) SUMMARY: Requesting comments on a new feature which remaps pages from one private anonymous mapping to another, without altering the vmas involved. Two alternatives exist but both have drawbacks: 1. userfaultfd ioctls allocate new pages, copy data and free the old ones even when updates could be done in-place; 2. mremap results in vma splitting in most of the cases due to 'pgoff' mism= atch. Proposing a new mremap flag or userfaultfd ioctl which enables remapping pages without these drawbacks. Such a feature, as described below, would be very helpful in efficient implementation of concurrent compaction algorithms. II) MOTIVATION: Garbage collectors (like the ones used in managed languages) perform defragmentation of the managed heap by moving objects (of varying sizes) within the heap. Usually these algorithms have to be concurrent to avoid response time concerns. These are concurrent in the sense that while the GC threads are compacting the heap, application threads continue to make progress, which means enabling access to the heap while objects are being simultaneously moved. Given the high overhead of heap compaction, such algorithms typically segregate the heap into two types of regions (set of contiguous pages): those that have enough fragmentation to compact, and those that are densely populated. While only =E2=80=98fragmented=E2=80=99 regions= are compacted by sliding objects, both types of regions are traversed to update references in them to the moved objects. A) PROT_NONE+SIGSEGV approach: One of the widely used techniques to ensure data integrity during concurrent compaction is to use page-level access interception. Traditionally, this is implemented by mprotecting (PROT_NONE) the heap before starting compaction and installing a SIGSEGV handler. When GC threads are compacting the heap, if some application threads fault on the heap, then they compact the faulted page in the SIGSEGV handler and then enable access to it before returning. To do this atomically, the heap must use shmem (MAP_SHARED) so that an alias mapping (with read-write permission) can be used for moving objects into and updating references. Limitation: due to different access rights, the heap can end up with one vma per page in the worst case, hitting the =E2=80=98max_map_count=E2= =80=99 limit. B) Userfaultfd approach: Userfaultfd avoids the vma split issue by intercepting page-faults when the page is missing and gives control to user-space to map the desired content. It doesn=E2=80=99t affect the vma properties. The compacti= on algorithm in this case works by first remapping the heap pages (using mremap) to a secondary mapping and then registering the heap with userfaultfd for MISSING faults. When an application thread accesses a page that has not yet been mapped (by other GC/application threads), a userfault occurs, and as a consequence the corresponding page is generated and mapped using one of the following two ioctls. 1) COPY ioctl: Typically the heap would be private anonymous in this case. For every page on the heap, compact the objects into a page-sized buffer, which COPY ioctl takes as input. The ioctl allocates a new page, copies the input buffer to it, and then maps it. This means that even for updating references in the densely populated regions (where compaction is not done), in-place updation is impossible. This results in unnecessary page-clear, memcpy and freeing. 2) CONTINUE ioctl: the two mappings (heap and secondary) are MAP_SHARED to the same shmem file. Userfaults in the =E2=80=98fragmented=E2= =80=99 regions are MISSING, in which case objects are compacted into the corresponding secondary mapping page (which triggers a regular page fault to get a page mapped) and then CONTINUE ioctl is invoked, which maps the same page on the heap mapping. On the other hand, userfaults in the =E2=80=98densely populated=E2=80=99 regions are MINOR (as the page a= lready exists in the secondary mapping), in which case we update the references in the already existing page on the secondary mapping and then invoke CONTINUE ioctl. Limitation: we observed in our implementation that page-faults/page-allocation, memcpy, and madvise took (with either of the two ioctls) ~50% of the time spent in compaction. III) USE CASE (of the proposed feature): The proposed feature of moving pages from one vma to another will enable us to: A) Recycle pages entirely in the userspace as they are freed (pages whose objects are already consumed as part of the current compaction cycle) in the =E2=80=98fragmented=E2=80=99 regions. This way we avoid page-= clearing (during page allocation) and memcpy (in the kernel). When the page is handed over to the kernel for remapping, there is nothing else needed to be done. Furthermore, since the page is being reused, it doesn=E2=80=99t have to be freed either. B) Implement a coarse-grained page-level compaction algorithm wherein pages containing live objects are slid next to each other without touching them, while reclaiming in-between pages which contain only garbage. Such an algorithm is very useful for compacting objects which are seldom accessed by application and hence are likely to be swapped out. Without this feature, this would require copying the pages containing live objects, for which the src pages have to be swapped-in, only to be soon swapped-out afterwards. AFAIK, none of the above features can be implemented using mremap (with current flags), irrespective of whether the heap is a shmem or private anonymous mapping, because: 1) When moving a page it=E2=80=99s likely that its index will need to chang= e and mremapping such a page would result in VMA splitting. 2) Using mremap for moving pages would result in the heap=E2=80=99s range being covered by several vmas. The mremap in the next compaction cycle (required prior to starting compaction as described above), will fail with EFAULT. This is because the src range in mremap is not allowed to span multiple vmas. On the other hand, calling it for each src vma is not feasible because: a) It=E2=80=99s not trivial to identify various vmas covering the heap ra= nge in userspace, and b) This operation is supposed to happen with application threads paused. Invoking numerous mremap syscalls in a pause risks causing janks. 3) Mremap has scalability concerns due to the need to acquire mmap_sem exclusively for splitting/merging VMAs. This would impact parallelism of application threads, particularly during the beginning of the compaction process when they are expected to cause a spurt of userfaults. IV) PROPOSAL: Initially, maybe the feature can be implemented only for private anonymous mappings. There are two ways this can be implemented: A) A new userfaultfd ioctl, =E2=80=98MOVE=E2=80=99, which takes the same in= puts as the =E2=80=98COPY=E2=80=99 ioctl. After sanity check, the ioctl would detach th= e pte entries from the src vma, and move them to dst vma while updating their =E2=80=98mapping=E2=80=99 and =E2=80=98index=E2=80=99 fields, if requ= ired. B) Add a new flag to mremap, =E2=80=98MREMAP_ONLYPAGES=E2=80=99, which work= s similar to the MOVE ioctl above. Assuming (A) is implemented, here is broadly how the compaction would work: * For a MISSING userfault in the =E2=80=98densely populated=E2=80=99 region= s, update pointers in-place in the secondary mapping page corresponding to the fault address (on the heap) and then use the MOVE ioctl to map it on the heap. In this case the =E2=80=98index=E2=80=99 field would remain the s= ame. * For a MISSING userfault in =E2=80=98fragmented=E2=80=99 regions, pick any= freed page in the secondary map, compact the objects corresponding to the fault address in this page and then use MOVE ioctl to map it on the fault address in the heap. This would require updating the =E2=80=98index=E2=80= =99 field. After compaction is completed, use madvise(MADV_DONTNEED) on the secondary mapping to free any remaining pages. Thanks, Lokesh