Re: [EXT] Re: COW in userspace

From: Ralf Ramsauer <ralf.ramsauer@oth-regensburg.de>
To: David Hildenbrand <david@redhat.com>, <linux-mm@kvack.org>
Cc: Wolfgang Mauerer <wolfgang.mauerer@oth-regensburg.de>,
	Mario Mintel <mario.mintel@st.oth-regensburg.de>
Subject: Re: [EXT] Re: COW in userspace
Date: Mon, 23 Aug 2021 12:49:08 +0200	[thread overview]
Message-ID: <7602103f-2c6e-3c1c-db03-a8c43a8fc32d@oth-regensburg.de> (raw)
In-Reply-To: <eadd41a9-8953-9f77-6e41-ce2301d4c3a3@redhat.com>

On 23/08/2021 12:33, David Hildenbrand wrote:
> On 23.08.21 12:16, Ralf Ramsauer wrote:
>>
>>
>> On 23/08/2021 10:02, David Hildenbrand wrote:
>>> On 20.08.21 15:13, Ralf Ramsauer wrote:
>>>> Dear mm folks,
>>>>
>>>> I have an issue, where it would be great to have a COW-backed virtual
>>>> memory area within an userspace process. I know there's the possibility
>>>> to have a file-backed MAP_SHARED vma, which is later duplicated with
>>>> MAP_PRIVATE, but that's not exactly what I'm looking for.
>>>>
>>>> Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and
>>>> PROT_RW. Userspace happily writes to/reads from it. At some point in
>>>> time, I want to 'snapshot' that single VMA within the context of the
>>>> process and without the need to fork(). Say there's something like
>>>>
>>>>     a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0);
>>>>     [... fill a ...]
>>>>
>>>>     b = mmdup(a, len, PROT_READ);
>>>>
>>>> b shall be the new base pointer of a new VMA that is backed by COW
>>>> mechanisms. After mmdup, those regular COW mechanisms do the rest: both
>>>> VMAs (a and b) will fault on subsequent writes and duplicate the
>>>> previously shared physical mapping, pretty much what cow_fault or
>>>> shared_fault does.
>>>>
>>>> Afaict, this, or at least something like this is currently not
>>>> supported
>>>> by the kernel. Is that correct? If so, why? Generally spoken, is it a
>>>> bad idea?
>>>
>>> Not sure if it helps (most probably not), QEMU uses uffd-wp for
>>> background snapshots of VM memory. It's different, though, as you'll
>>> only have a single mapping and will be catching modifications to your
>>> single mapping, such that you can "safe away" relevant snapshot pages
>>> before any modifications.
>>
>> Thanks for the pointer, David. I'll have a look.
>>
>>>
>>> You mention "both VMAs (a and b) will fault on subsequent writes", so
>>> would you actually be allowing PROT_WRITE access to b ("snapshot")?
>>>
>>
>> In general, yes, both should be allowed to be PROT_WRITE. So no matter
>> "which side" causes the fault, simply both will lead to duplication.
>>
>> If it would make things easier, then it would also be absolutely fine to
>> have the snapshot PROT_READ, which would suffice my requirements as well.
> 
> I recall that Redis has very similar requirements for live snapshotting.

100 points, you just managed to figure out what we're exackty working
on! ;-)

> They used to handle it via fork() just as you described as I was told. I

Right, and fork() is damn slow, especially when forking large mappings.
A simple mmap() of the same area (w/o population) is at least 4x faster.
And you don't have to do all the stuff that's implied by fork, and you
actually don't need.

> don't know if they already switched to uffd-wp, but I would guess they
> already did, because they were another excellent use case for uffd-wp
> 
> https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg02955.html
> 
> You can handle COW manually in user space that way
> 
> 1. Creating a second anonymous mapping
> 2. Registering a UFFD-WP handler on the original mapping
> 3. WP-protecting the original mapping via UFFD
> 4. Tracking in a bitmap which pages were already copied

Ok, great, thanks, I'll have a look into that one!

> 
> So when you get notified about a WP event, you copy the page manually to
> the second mapping, un-protect the page, and remember in the bitmap that
> the page has been copied.
> 
> When reading the snapshot, you have to take a look at the bitmap to
> figure out if you have to read a specific page from the original, or
> from the second mapping. But you won't be able to just read the second
> mapping. (question would be, if that is really required or can be
> worked-around)

Thanks a bunch!
  Ralf