From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80FCFC4338F for ; Mon, 23 Aug 2021 10:49:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D93676137D for ; Mon, 23 Aug 2021 10:49:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org D93676137D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oth-regensburg.de Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 4367E6B006C; Mon, 23 Aug 2021 06:49:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3E7AD6B0072; Mon, 23 Aug 2021 06:49:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2AF418D0001; Mon, 23 Aug 2021 06:49:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0228.hostedemail.com [216.40.44.228]) by kanga.kvack.org (Postfix) with ESMTP id 0DDDF6B006C for ; Mon, 23 Aug 2021 06:49:11 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 9C1F926810 for ; Mon, 23 Aug 2021 10:49:10 +0000 (UTC) X-FDA: 78506023260.30.AC5848A Received: from mta02.hs-regensburg.de (mta02.hs-regensburg.de [194.95.104.12]) by imf12.hostedemail.com (Postfix) with ESMTP id 30BA310000A5 for ; Mon, 23 Aug 2021 10:49:10 +0000 (UTC) Received: from E16S03.hs-regensburg.de (e16s03.hs-regensburg.de [IPv6:2001:638:a01:8013::93]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "E16S03", Issuer "E16S03" (not verified)) by mta02.hs-regensburg.de (Postfix) with ESMTPS id 4GtTWS6qHMzyJp; Mon, 23 Aug 2021 12:49:08 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oth-regensburg.de; s=mta01-20160622; t=1629715749; bh=58+9rhBkc+u1/P+UeG3AUAHnObmwYFBkh+o9LDZSR7M=; h=Subject:To:CC:References:From:Date:In-Reply-To:From; b=mYlFGix6ym6lgXCzifxnP7/JH3gu5zSyaFYKlexVlsU5zsOEAiEYJm6XJY1GqOYYU fcmKVbIuXHY10Un+d9VeoLiLA71DeqYyOEulLr5dSdVKm2kmLY2Xwhc1V9K1RalRYP OzvQnu3TLHUvpKp8OKbFws++/HIoL6Yk/f9Wt/4XCt9ekWDTqLW2vPhH4JiCDErLZ+ 6Hpsa26Hy6wQ1Z/sjpBQbkazxl8z31aBcU/xVD7I7XZwaAUL8FlSE2p7xpppqWhTfG ArbeioA1CsjrDRonisr9G/TK0I4/vwuyoqMHkAQNAWEEQB5GVwdWYtYbstRRzF3FIV GwR6Kit/OI19g== Received: from [IPv6:2001:638:a01:8061:5c51:6883:5436:5db] (2001:638:a01:8013::138) by E16S03.hs-regensburg.de (2001:638:a01:8013::93) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2242.12; Mon, 23 Aug 2021 12:49:08 +0200 Subject: Re: [EXT] Re: COW in userspace To: David Hildenbrand , CC: Wolfgang Mauerer , Mario Mintel References: <8bc6b208-2b4c-03d6-c9c3-c36daf55d3f7@redhat.com> From: Ralf Ramsauer Message-ID: <7602103f-2c6e-3c1c-db03-a8c43a8fc32d@oth-regensburg.de> Date: Mon, 23 Aug 2021 12:49:08 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Language: en-US X-Originating-IP: [2001:638:a01:8013::138] X-ClientProxiedBy: E16S02.hs-regensburg.de (2001:638:a01:8013::92) To E16S03.hs-regensburg.de (2001:638:a01:8013::93) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=oth-regensburg.de header.s=mta01-20160622 header.b=mYlFGix6; dmarc=pass (policy=none) header.from=oth-regensburg.de; spf=pass (imf12.hostedemail.com: domain of ralf.ramsauer@oth-regensburg.de designates 194.95.104.12 as permitted sender) smtp.mailfrom=ralf.ramsauer@oth-regensburg.de X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 30BA310000A5 X-Stat-Signature: 4zqebuf3ck6xj17rnsurtye78o4pfuhr X-HE-Tag: 1629715750-641808 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 23/08/2021 12:33, David Hildenbrand wrote: > On 23.08.21 12:16, Ralf Ramsauer wrote: >> >> >> On 23/08/2021 10:02, David Hildenbrand wrote: >>> On 20.08.21 15:13, Ralf Ramsauer wrote: >>>> Dear mm folks, >>>> >>>> I have an issue, where it would be great to have a COW-backed virtua= l >>>> memory area within an userspace process. I know there's the possibil= ity >>>> to have a file-backed MAP_SHARED vma, which is later duplicated with >>>> MAP_PRIVATE, but that's not exactly what I'm looking for. >>>> >>>> Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and >>>> PROT_RW. Userspace happily writes to/reads from it. At some point in >>>> time, I want to 'snapshot' that single VMA within the context of the >>>> process and without the need to fork(). Say there's something like >>>> >>>> =C2=A0=C2=A0=C2=A0 a =3D mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULA= TE, -1, 0); >>>> =C2=A0=C2=A0=C2=A0 [... fill a ...] >>>> >>>> =C2=A0=C2=A0=C2=A0 b =3D mmdup(a, len, PROT_READ); >>>> >>>> b shall be the new base pointer of a new VMA that is backed by COW >>>> mechanisms. After mmdup, those regular COW mechanisms do the rest: b= oth >>>> VMAs (a and b) will fault on subsequent writes and duplicate the >>>> previously shared physical mapping, pretty much what cow_fault or >>>> shared_fault does. >>>> >>>> Afaict, this, or at least something like this is currently not >>>> supported >>>> by the kernel. Is that correct? If so, why? Generally spoken, is it = a >>>> bad idea? >>> >>> Not sure if it helps (most probably not), QEMU uses uffd-wp for >>> background snapshots of VM memory. It's different, though, as you'll >>> only have a single mapping and will be catching modifications to your >>> single mapping, such that you can "safe away" relevant snapshot pages >>> before any modifications. >> >> Thanks for the pointer, David. I'll have a look. >> >>> >>> You mention "both VMAs (a and b) will fault on subsequent writes", so >>> would you actually be allowing PROT_WRITE access to b ("snapshot")? >>> >> >> In general, yes, both should be allowed to be PROT_WRITE. So no matter >> "which side" causes the fault, simply both will lead to duplication. >> >> If it would make things easier, then it would also be absolutely fine = to >> have the snapshot PROT_READ, which would suffice my requirements as we= ll. >=20 > I recall that Redis has very similar requirements for live snapshotting= . 100 points, you just managed to figure out what we're exackty working on! ;-) > They used to handle it via fork() just as you described as I was told. = I Right, and fork() is damn slow, especially when forking large mappings. A simple mmap() of the same area (w/o population) is at least 4x faster. And you don't have to do all the stuff that's implied by fork, and you actually don't need. > don't know if they already switched to uffd-wp, but I would guess they > already did, because they were another excellent use case for uffd-wp >=20 > https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg02955.html >=20 > You can handle COW manually in user space that way >=20 > 1. Creating a second anonymous mapping > 2. Registering a UFFD-WP handler on the original mapping > 3. WP-protecting the original mapping via UFFD > 4. Tracking in a bitmap which pages were already copied Ok, great, thanks, I'll have a look into that one! >=20 > So when you get notified about a WP event, you copy the page manually t= o > the second mapping, un-protect the page, and remember in the bitmap tha= t > the page has been copied. >=20 > When reading the snapshot, you have to take a look at the bitmap to > figure out if you have to read a specific page from the original, or > from the second mapping. But you won't be able to just read the second > mapping. (question would be, if that is really required or can be > worked-around) Thanks a bunch! Ralf