linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* COW in userspace
@ 2021-08-20 13:13 Ralf Ramsauer
  2021-08-20 23:12 ` Jerome Glisse
  2021-08-23  8:02 ` David Hildenbrand
  0 siblings, 2 replies; 6+ messages in thread
From: Ralf Ramsauer @ 2021-08-20 13:13 UTC (permalink / raw)
  To: linux-mm; +Cc: Wolfgang Mauerer, Mario Mintel

Dear mm folks,

I have an issue, where it would be great to have a COW-backed virtual
memory area within an userspace process. I know there's the possibility
to have a file-backed MAP_SHARED vma, which is later duplicated with
MAP_PRIVATE, but that's not exactly what I'm looking for.

Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and
PROT_RW. Userspace happily writes to/reads from it. At some point in
time, I want to 'snapshot' that single VMA within the context of the
process and without the need to fork(). Say there's something like

  a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0);
  [... fill a ...]

  b = mmdup(a, len, PROT_READ);

b shall be the new base pointer of a new VMA that is backed by COW
mechanisms. After mmdup, those regular COW mechanisms do the rest: both
VMAs (a and b) will fault on subsequent writes and duplicate the
previously shared physical mapping, pretty much what cow_fault or
shared_fault does.

Afaict, this, or at least something like this is currently not supported
by the kernel. Is that correct? If so, why? Generally spoken, is it a
bad idea?

I digged a bit into the mm code, and I think all the stuff that would be
required is already there, so I wonder what I'm missing.


This is some related work I found on that topic:

https://sfb876.tu-dortmund.de/PublicPublicationFiles/kotthaus_2016a.pdf

They implement mmapcopy(), which pretty much would fulfill my
requirements. However, I still wonder why the kernel doesn't support
something like that by default, so maybe some mm expert could shed light
on this.

Any suggestions welcome!

Thanks
  Ralf


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: COW in userspace
  2021-08-20 13:13 COW in userspace Ralf Ramsauer
@ 2021-08-20 23:12 ` Jerome Glisse
  2021-08-23  8:02 ` David Hildenbrand
  1 sibling, 0 replies; 6+ messages in thread
From: Jerome Glisse @ 2021-08-20 23:12 UTC (permalink / raw)
  To: Ralf Ramsauer; +Cc: linux-mm, Wolfgang Mauerer, Mario Mintel

On Fri, Aug 20, 2021 at 03:13:11PM +0200, Ralf Ramsauer wrote:
> Dear mm folks,
> 
> I have an issue, where it would be great to have a COW-backed virtual
> memory area within an userspace process. I know there's the possibility
> to have a file-backed MAP_SHARED vma, which is later duplicated with
> MAP_PRIVATE, but that's not exactly what I'm looking for.
> 
> Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and
> PROT_RW. Userspace happily writes to/reads from it. At some point in
> time, I want to 'snapshot' that single VMA within the context of the
> process and without the need to fork(). Say there's something like
> 
>   a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0);
>   [... fill a ...]
> 
>   b = mmdup(a, len, PROT_READ);
> 
> b shall be the new base pointer of a new VMA that is backed by COW
> mechanisms. After mmdup, those regular COW mechanisms do the rest: both
> VMAs (a and b) will fault on subsequent writes and duplicate the
> previously shared physical mapping, pretty much what cow_fault or
> shared_fault does.
> 
> Afaict, this, or at least something like this is currently not supported
> by the kernel. Is that correct? If so, why? Generally spoken, is it a
> bad idea?

Not supported. I guess they never was an enticing use case, ie a known
application which we care about which would benefit from such feature.
Proving that means you have to do the kernel patch and update the app
to get some benchmark.

I also think that this would be too much like MAP_COPY which is a bad
idea for file back vma (see [1]). So even if we were to restrict it to
anonymous memory it might make people feel uneasy as one could fear that
some crazy folks would try to extend it to file back vma.

Note that what is in [1] does not apply to anonymous memory as anonymous
memory with anon_vma already have per page versioning tracking (ignoring
the can of worm that COW is and all the issues we are finding about it).


[1] https://yarchive.net/comp/linux/map_copy.html


> 
> I digged a bit into the mm code, and I think all the stuff that would be
> required is already there, so I wonder what I'm missing.
> 
> 
> This is some related work I found on that topic:
> 
> https://sfb876.tu-dortmund.de/PublicPublicationFiles/kotthaus_2016a.pdf
> 
> They implement mmapcopy(), which pretty much would fulfill my
> requirements. However, I still wonder why the kernel doesn't support
> something like that by default, so maybe some mm expert could shed light
> on this.
> 

Quickly looking at results it is not impressive, it only improve the
situation if you compare it to KSM. Original code seems to be within
the margin error from performance point of view.

Cheers,
Jérôme Glisse


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: COW in userspace
  2021-08-20 13:13 COW in userspace Ralf Ramsauer
  2021-08-20 23:12 ` Jerome Glisse
@ 2021-08-23  8:02 ` David Hildenbrand
  2021-08-23 10:16   ` [EXT] " Ralf Ramsauer
  1 sibling, 1 reply; 6+ messages in thread
From: David Hildenbrand @ 2021-08-23  8:02 UTC (permalink / raw)
  To: Ralf Ramsauer, linux-mm; +Cc: Wolfgang Mauerer, Mario Mintel

On 20.08.21 15:13, Ralf Ramsauer wrote:
> Dear mm folks,
> 
> I have an issue, where it would be great to have a COW-backed virtual
> memory area within an userspace process. I know there's the possibility
> to have a file-backed MAP_SHARED vma, which is later duplicated with
> MAP_PRIVATE, but that's not exactly what I'm looking for.
> 
> Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and
> PROT_RW. Userspace happily writes to/reads from it. At some point in
> time, I want to 'snapshot' that single VMA within the context of the
> process and without the need to fork(). Say there's something like
> 
>    a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0);
>    [... fill a ...]
> 
>    b = mmdup(a, len, PROT_READ);
> 
> b shall be the new base pointer of a new VMA that is backed by COW
> mechanisms. After mmdup, those regular COW mechanisms do the rest: both
> VMAs (a and b) will fault on subsequent writes and duplicate the
> previously shared physical mapping, pretty much what cow_fault or
> shared_fault does.
> 
> Afaict, this, or at least something like this is currently not supported
> by the kernel. Is that correct? If so, why? Generally spoken, is it a
> bad idea?

Not sure if it helps (most probably not), QEMU uses uffd-wp for 
background snapshots of VM memory. It's different, though, as you'll 
only have a single mapping and will be catching modifications to your 
single mapping, such that you can "safe away" relevant snapshot pages 
before any modifications.

You mention "both VMAs (a and b) will fault on subsequent writes", so 
would you actually be allowing PROT_WRITE access to b ("snapshot")?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [EXT] Re: COW in userspace
  2021-08-23  8:02 ` David Hildenbrand
@ 2021-08-23 10:16   ` Ralf Ramsauer
  2021-08-23 10:33     ` David Hildenbrand
  0 siblings, 1 reply; 6+ messages in thread
From: Ralf Ramsauer @ 2021-08-23 10:16 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm; +Cc: Wolfgang Mauerer, Mario Mintel



On 23/08/2021 10:02, David Hildenbrand wrote:
> On 20.08.21 15:13, Ralf Ramsauer wrote:
>> Dear mm folks,
>>
>> I have an issue, where it would be great to have a COW-backed virtual
>> memory area within an userspace process. I know there's the possibility
>> to have a file-backed MAP_SHARED vma, which is later duplicated with
>> MAP_PRIVATE, but that's not exactly what I'm looking for.
>>
>> Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and
>> PROT_RW. Userspace happily writes to/reads from it. At some point in
>> time, I want to 'snapshot' that single VMA within the context of the
>> process and without the need to fork(). Say there's something like
>>
>>    a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0);
>>    [... fill a ...]
>>
>>    b = mmdup(a, len, PROT_READ);
>>
>> b shall be the new base pointer of a new VMA that is backed by COW
>> mechanisms. After mmdup, those regular COW mechanisms do the rest: both
>> VMAs (a and b) will fault on subsequent writes and duplicate the
>> previously shared physical mapping, pretty much what cow_fault or
>> shared_fault does.
>>
>> Afaict, this, or at least something like this is currently not supported
>> by the kernel. Is that correct? If so, why? Generally spoken, is it a
>> bad idea?
> 
> Not sure if it helps (most probably not), QEMU uses uffd-wp for
> background snapshots of VM memory. It's different, though, as you'll
> only have a single mapping and will be catching modifications to your
> single mapping, such that you can "safe away" relevant snapshot pages
> before any modifications.

Thanks for the pointer, David. I'll have a look.

> 
> You mention "both VMAs (a and b) will fault on subsequent writes", so
> would you actually be allowing PROT_WRITE access to b ("snapshot")?
> 

In general, yes, both should be allowed to be PROT_WRITE. So no matter
"which side" causes the fault, simply both will lead to duplication.

If it would make things easier, then it would also be absolutely fine to
have the snapshot PROT_READ, which would suffice my requirements as well.

Thanks
  Ralf


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [EXT] Re: COW in userspace
  2021-08-23 10:16   ` [EXT] " Ralf Ramsauer
@ 2021-08-23 10:33     ` David Hildenbrand
  2021-08-23 10:49       ` Ralf Ramsauer
  0 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand @ 2021-08-23 10:33 UTC (permalink / raw)
  To: Ralf Ramsauer, linux-mm; +Cc: Wolfgang Mauerer, Mario Mintel

On 23.08.21 12:16, Ralf Ramsauer wrote:
> 
> 
> On 23/08/2021 10:02, David Hildenbrand wrote:
>> On 20.08.21 15:13, Ralf Ramsauer wrote:
>>> Dear mm folks,
>>>
>>> I have an issue, where it would be great to have a COW-backed virtual
>>> memory area within an userspace process. I know there's the possibility
>>> to have a file-backed MAP_SHARED vma, which is later duplicated with
>>> MAP_PRIVATE, but that's not exactly what I'm looking for.
>>>
>>> Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and
>>> PROT_RW. Userspace happily writes to/reads from it. At some point in
>>> time, I want to 'snapshot' that single VMA within the context of the
>>> process and without the need to fork(). Say there's something like
>>>
>>>     a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0);
>>>     [... fill a ...]
>>>
>>>     b = mmdup(a, len, PROT_READ);
>>>
>>> b shall be the new base pointer of a new VMA that is backed by COW
>>> mechanisms. After mmdup, those regular COW mechanisms do the rest: both
>>> VMAs (a and b) will fault on subsequent writes and duplicate the
>>> previously shared physical mapping, pretty much what cow_fault or
>>> shared_fault does.
>>>
>>> Afaict, this, or at least something like this is currently not supported
>>> by the kernel. Is that correct? If so, why? Generally spoken, is it a
>>> bad idea?
>>
>> Not sure if it helps (most probably not), QEMU uses uffd-wp for
>> background snapshots of VM memory. It's different, though, as you'll
>> only have a single mapping and will be catching modifications to your
>> single mapping, such that you can "safe away" relevant snapshot pages
>> before any modifications.
> 
> Thanks for the pointer, David. I'll have a look.
> 
>>
>> You mention "both VMAs (a and b) will fault on subsequent writes", so
>> would you actually be allowing PROT_WRITE access to b ("snapshot")?
>>
> 
> In general, yes, both should be allowed to be PROT_WRITE. So no matter
> "which side" causes the fault, simply both will lead to duplication.
> 
> If it would make things easier, then it would also be absolutely fine to
> have the snapshot PROT_READ, which would suffice my requirements as well.

I recall that Redis has very similar requirements for live snapshotting. 
They used to handle it via fork() just as you described as I was told. I 
don't know if they already switched to uffd-wp, but I would guess they 
already did, because they were another excellent use case for uffd-wp

https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg02955.html

You can handle COW manually in user space that way

1. Creating a second anonymous mapping
2. Registering a UFFD-WP handler on the original mapping
3. WP-protecting the original mapping via UFFD
4. Tracking in a bitmap which pages were already copied

So when you get notified about a WP event, you copy the page manually to 
the second mapping, un-protect the page, and remember in the bitmap that 
the page has been copied.

When reading the snapshot, you have to take a look at the bitmap to 
figure out if you have to read a specific page from the original, or 
from the second mapping. But you won't be able to just read the second 
mapping. (question would be, if that is really required or can be 
worked-around)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [EXT] Re: COW in userspace
  2021-08-23 10:33     ` David Hildenbrand
@ 2021-08-23 10:49       ` Ralf Ramsauer
  0 siblings, 0 replies; 6+ messages in thread
From: Ralf Ramsauer @ 2021-08-23 10:49 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm; +Cc: Wolfgang Mauerer, Mario Mintel



On 23/08/2021 12:33, David Hildenbrand wrote:
> On 23.08.21 12:16, Ralf Ramsauer wrote:
>>
>>
>> On 23/08/2021 10:02, David Hildenbrand wrote:
>>> On 20.08.21 15:13, Ralf Ramsauer wrote:
>>>> Dear mm folks,
>>>>
>>>> I have an issue, where it would be great to have a COW-backed virtual
>>>> memory area within an userspace process. I know there's the possibility
>>>> to have a file-backed MAP_SHARED vma, which is later duplicated with
>>>> MAP_PRIVATE, but that's not exactly what I'm looking for.
>>>>
>>>> Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and
>>>> PROT_RW. Userspace happily writes to/reads from it. At some point in
>>>> time, I want to 'snapshot' that single VMA within the context of the
>>>> process and without the need to fork(). Say there's something like
>>>>
>>>>     a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0);
>>>>     [... fill a ...]
>>>>
>>>>     b = mmdup(a, len, PROT_READ);
>>>>
>>>> b shall be the new base pointer of a new VMA that is backed by COW
>>>> mechanisms. After mmdup, those regular COW mechanisms do the rest: both
>>>> VMAs (a and b) will fault on subsequent writes and duplicate the
>>>> previously shared physical mapping, pretty much what cow_fault or
>>>> shared_fault does.
>>>>
>>>> Afaict, this, or at least something like this is currently not
>>>> supported
>>>> by the kernel. Is that correct? If so, why? Generally spoken, is it a
>>>> bad idea?
>>>
>>> Not sure if it helps (most probably not), QEMU uses uffd-wp for
>>> background snapshots of VM memory. It's different, though, as you'll
>>> only have a single mapping and will be catching modifications to your
>>> single mapping, such that you can "safe away" relevant snapshot pages
>>> before any modifications.
>>
>> Thanks for the pointer, David. I'll have a look.
>>
>>>
>>> You mention "both VMAs (a and b) will fault on subsequent writes", so
>>> would you actually be allowing PROT_WRITE access to b ("snapshot")?
>>>
>>
>> In general, yes, both should be allowed to be PROT_WRITE. So no matter
>> "which side" causes the fault, simply both will lead to duplication.
>>
>> If it would make things easier, then it would also be absolutely fine to
>> have the snapshot PROT_READ, which would suffice my requirements as well.
> 
> I recall that Redis has very similar requirements for live snapshotting.

100 points, you just managed to figure out what we're exackty working
on! ;-)

> They used to handle it via fork() just as you described as I was told. I

Right, and fork() is damn slow, especially when forking large mappings.
A simple mmap() of the same area (w/o population) is at least 4x faster.
And you don't have to do all the stuff that's implied by fork, and you
actually don't need.

> don't know if they already switched to uffd-wp, but I would guess they
> already did, because they were another excellent use case for uffd-wp
> 
> https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg02955.html
> 
> You can handle COW manually in user space that way
> 
> 1. Creating a second anonymous mapping
> 2. Registering a UFFD-WP handler on the original mapping
> 3. WP-protecting the original mapping via UFFD
> 4. Tracking in a bitmap which pages were already copied

Ok, great, thanks, I'll have a look into that one!

> 
> So when you get notified about a WP event, you copy the page manually to
> the second mapping, un-protect the page, and remember in the bitmap that
> the page has been copied.
> 
> When reading the snapshot, you have to take a look at the bitmap to
> figure out if you have to read a specific page from the original, or
> from the second mapping. But you won't be able to just read the second
> mapping. (question would be, if that is really required or can be
> worked-around)

Thanks a bunch!
  Ralf


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-08-23 10:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-20 13:13 COW in userspace Ralf Ramsauer
2021-08-20 23:12 ` Jerome Glisse
2021-08-23  8:02 ` David Hildenbrand
2021-08-23 10:16   ` [EXT] " Ralf Ramsauer
2021-08-23 10:33     ` David Hildenbrand
2021-08-23 10:49       ` Ralf Ramsauer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).