* COW in userspace @ 2021-08-20 13:13 Ralf Ramsauer 2021-08-20 23:12 ` Jerome Glisse 2021-08-23 8:02 ` David Hildenbrand 0 siblings, 2 replies; 6+ messages in thread From: Ralf Ramsauer @ 2021-08-20 13:13 UTC (permalink / raw) To: linux-mm; +Cc: Wolfgang Mauerer, Mario Mintel Dear mm folks, I have an issue, where it would be great to have a COW-backed virtual memory area within an userspace process. I know there's the possibility to have a file-backed MAP_SHARED vma, which is later duplicated with MAP_PRIVATE, but that's not exactly what I'm looking for. Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and PROT_RW. Userspace happily writes to/reads from it. At some point in time, I want to 'snapshot' that single VMA within the context of the process and without the need to fork(). Say there's something like a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0); [... fill a ...] b = mmdup(a, len, PROT_READ); b shall be the new base pointer of a new VMA that is backed by COW mechanisms. After mmdup, those regular COW mechanisms do the rest: both VMAs (a and b) will fault on subsequent writes and duplicate the previously shared physical mapping, pretty much what cow_fault or shared_fault does. Afaict, this, or at least something like this is currently not supported by the kernel. Is that correct? If so, why? Generally spoken, is it a bad idea? I digged a bit into the mm code, and I think all the stuff that would be required is already there, so I wonder what I'm missing. This is some related work I found on that topic: https://sfb876.tu-dortmund.de/PublicPublicationFiles/kotthaus_2016a.pdf They implement mmapcopy(), which pretty much would fulfill my requirements. However, I still wonder why the kernel doesn't support something like that by default, so maybe some mm expert could shed light on this. Any suggestions welcome! Thanks Ralf ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: COW in userspace 2021-08-20 13:13 COW in userspace Ralf Ramsauer @ 2021-08-20 23:12 ` Jerome Glisse 2021-08-23 8:02 ` David Hildenbrand 1 sibling, 0 replies; 6+ messages in thread From: Jerome Glisse @ 2021-08-20 23:12 UTC (permalink / raw) To: Ralf Ramsauer; +Cc: linux-mm, Wolfgang Mauerer, Mario Mintel On Fri, Aug 20, 2021 at 03:13:11PM +0200, Ralf Ramsauer wrote: > Dear mm folks, > > I have an issue, where it would be great to have a COW-backed virtual > memory area within an userspace process. I know there's the possibility > to have a file-backed MAP_SHARED vma, which is later duplicated with > MAP_PRIVATE, but that's not exactly what I'm looking for. > > Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and > PROT_RW. Userspace happily writes to/reads from it. At some point in > time, I want to 'snapshot' that single VMA within the context of the > process and without the need to fork(). Say there's something like > > a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0); > [... fill a ...] > > b = mmdup(a, len, PROT_READ); > > b shall be the new base pointer of a new VMA that is backed by COW > mechanisms. After mmdup, those regular COW mechanisms do the rest: both > VMAs (a and b) will fault on subsequent writes and duplicate the > previously shared physical mapping, pretty much what cow_fault or > shared_fault does. > > Afaict, this, or at least something like this is currently not supported > by the kernel. Is that correct? If so, why? Generally spoken, is it a > bad idea? Not supported. I guess they never was an enticing use case, ie a known application which we care about which would benefit from such feature. Proving that means you have to do the kernel patch and update the app to get some benchmark. I also think that this would be too much like MAP_COPY which is a bad idea for file back vma (see [1]). So even if we were to restrict it to anonymous memory it might make people feel uneasy as one could fear that some crazy folks would try to extend it to file back vma. Note that what is in [1] does not apply to anonymous memory as anonymous memory with anon_vma already have per page versioning tracking (ignoring the can of worm that COW is and all the issues we are finding about it). [1] https://yarchive.net/comp/linux/map_copy.html > > I digged a bit into the mm code, and I think all the stuff that would be > required is already there, so I wonder what I'm missing. > > > This is some related work I found on that topic: > > https://sfb876.tu-dortmund.de/PublicPublicationFiles/kotthaus_2016a.pdf > > They implement mmapcopy(), which pretty much would fulfill my > requirements. However, I still wonder why the kernel doesn't support > something like that by default, so maybe some mm expert could shed light > on this. > Quickly looking at results it is not impressive, it only improve the situation if you compare it to KSM. Original code seems to be within the margin error from performance point of view. Cheers, Jérôme Glisse ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: COW in userspace 2021-08-20 13:13 COW in userspace Ralf Ramsauer 2021-08-20 23:12 ` Jerome Glisse @ 2021-08-23 8:02 ` David Hildenbrand 2021-08-23 10:16 ` [EXT] " Ralf Ramsauer 1 sibling, 1 reply; 6+ messages in thread From: David Hildenbrand @ 2021-08-23 8:02 UTC (permalink / raw) To: Ralf Ramsauer, linux-mm; +Cc: Wolfgang Mauerer, Mario Mintel On 20.08.21 15:13, Ralf Ramsauer wrote: > Dear mm folks, > > I have an issue, where it would be great to have a COW-backed virtual > memory area within an userspace process. I know there's the possibility > to have a file-backed MAP_SHARED vma, which is later duplicated with > MAP_PRIVATE, but that's not exactly what I'm looking for. > > Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and > PROT_RW. Userspace happily writes to/reads from it. At some point in > time, I want to 'snapshot' that single VMA within the context of the > process and without the need to fork(). Say there's something like > > a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0); > [... fill a ...] > > b = mmdup(a, len, PROT_READ); > > b shall be the new base pointer of a new VMA that is backed by COW > mechanisms. After mmdup, those regular COW mechanisms do the rest: both > VMAs (a and b) will fault on subsequent writes and duplicate the > previously shared physical mapping, pretty much what cow_fault or > shared_fault does. > > Afaict, this, or at least something like this is currently not supported > by the kernel. Is that correct? If so, why? Generally spoken, is it a > bad idea? Not sure if it helps (most probably not), QEMU uses uffd-wp for background snapshots of VM memory. It's different, though, as you'll only have a single mapping and will be catching modifications to your single mapping, such that you can "safe away" relevant snapshot pages before any modifications. You mention "both VMAs (a and b) will fault on subsequent writes", so would you actually be allowing PROT_WRITE access to b ("snapshot")? -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [EXT] Re: COW in userspace 2021-08-23 8:02 ` David Hildenbrand @ 2021-08-23 10:16 ` Ralf Ramsauer 2021-08-23 10:33 ` David Hildenbrand 0 siblings, 1 reply; 6+ messages in thread From: Ralf Ramsauer @ 2021-08-23 10:16 UTC (permalink / raw) To: David Hildenbrand, linux-mm; +Cc: Wolfgang Mauerer, Mario Mintel On 23/08/2021 10:02, David Hildenbrand wrote: > On 20.08.21 15:13, Ralf Ramsauer wrote: >> Dear mm folks, >> >> I have an issue, where it would be great to have a COW-backed virtual >> memory area within an userspace process. I know there's the possibility >> to have a file-backed MAP_SHARED vma, which is later duplicated with >> MAP_PRIVATE, but that's not exactly what I'm looking for. >> >> Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and >> PROT_RW. Userspace happily writes to/reads from it. At some point in >> time, I want to 'snapshot' that single VMA within the context of the >> process and without the need to fork(). Say there's something like >> >> a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0); >> [... fill a ...] >> >> b = mmdup(a, len, PROT_READ); >> >> b shall be the new base pointer of a new VMA that is backed by COW >> mechanisms. After mmdup, those regular COW mechanisms do the rest: both >> VMAs (a and b) will fault on subsequent writes and duplicate the >> previously shared physical mapping, pretty much what cow_fault or >> shared_fault does. >> >> Afaict, this, or at least something like this is currently not supported >> by the kernel. Is that correct? If so, why? Generally spoken, is it a >> bad idea? > > Not sure if it helps (most probably not), QEMU uses uffd-wp for > background snapshots of VM memory. It's different, though, as you'll > only have a single mapping and will be catching modifications to your > single mapping, such that you can "safe away" relevant snapshot pages > before any modifications. Thanks for the pointer, David. I'll have a look. > > You mention "both VMAs (a and b) will fault on subsequent writes", so > would you actually be allowing PROT_WRITE access to b ("snapshot")? > In general, yes, both should be allowed to be PROT_WRITE. So no matter "which side" causes the fault, simply both will lead to duplication. If it would make things easier, then it would also be absolutely fine to have the snapshot PROT_READ, which would suffice my requirements as well. Thanks Ralf ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [EXT] Re: COW in userspace 2021-08-23 10:16 ` [EXT] " Ralf Ramsauer @ 2021-08-23 10:33 ` David Hildenbrand 2021-08-23 10:49 ` Ralf Ramsauer 0 siblings, 1 reply; 6+ messages in thread From: David Hildenbrand @ 2021-08-23 10:33 UTC (permalink / raw) To: Ralf Ramsauer, linux-mm; +Cc: Wolfgang Mauerer, Mario Mintel On 23.08.21 12:16, Ralf Ramsauer wrote: > > > On 23/08/2021 10:02, David Hildenbrand wrote: >> On 20.08.21 15:13, Ralf Ramsauer wrote: >>> Dear mm folks, >>> >>> I have an issue, where it would be great to have a COW-backed virtual >>> memory area within an userspace process. I know there's the possibility >>> to have a file-backed MAP_SHARED vma, which is later duplicated with >>> MAP_PRIVATE, but that's not exactly what I'm looking for. >>> >>> Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and >>> PROT_RW. Userspace happily writes to/reads from it. At some point in >>> time, I want to 'snapshot' that single VMA within the context of the >>> process and without the need to fork(). Say there's something like >>> >>> a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0); >>> [... fill a ...] >>> >>> b = mmdup(a, len, PROT_READ); >>> >>> b shall be the new base pointer of a new VMA that is backed by COW >>> mechanisms. After mmdup, those regular COW mechanisms do the rest: both >>> VMAs (a and b) will fault on subsequent writes and duplicate the >>> previously shared physical mapping, pretty much what cow_fault or >>> shared_fault does. >>> >>> Afaict, this, or at least something like this is currently not supported >>> by the kernel. Is that correct? If so, why? Generally spoken, is it a >>> bad idea? >> >> Not sure if it helps (most probably not), QEMU uses uffd-wp for >> background snapshots of VM memory. It's different, though, as you'll >> only have a single mapping and will be catching modifications to your >> single mapping, such that you can "safe away" relevant snapshot pages >> before any modifications. > > Thanks for the pointer, David. I'll have a look. > >> >> You mention "both VMAs (a and b) will fault on subsequent writes", so >> would you actually be allowing PROT_WRITE access to b ("snapshot")? >> > > In general, yes, both should be allowed to be PROT_WRITE. So no matter > "which side" causes the fault, simply both will lead to duplication. > > If it would make things easier, then it would also be absolutely fine to > have the snapshot PROT_READ, which would suffice my requirements as well. I recall that Redis has very similar requirements for live snapshotting. They used to handle it via fork() just as you described as I was told. I don't know if they already switched to uffd-wp, but I would guess they already did, because they were another excellent use case for uffd-wp https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg02955.html You can handle COW manually in user space that way 1. Creating a second anonymous mapping 2. Registering a UFFD-WP handler on the original mapping 3. WP-protecting the original mapping via UFFD 4. Tracking in a bitmap which pages were already copied So when you get notified about a WP event, you copy the page manually to the second mapping, un-protect the page, and remember in the bitmap that the page has been copied. When reading the snapshot, you have to take a look at the bitmap to figure out if you have to read a specific page from the original, or from the second mapping. But you won't be able to just read the second mapping. (question would be, if that is really required or can be worked-around) -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [EXT] Re: COW in userspace 2021-08-23 10:33 ` David Hildenbrand @ 2021-08-23 10:49 ` Ralf Ramsauer 0 siblings, 0 replies; 6+ messages in thread From: Ralf Ramsauer @ 2021-08-23 10:49 UTC (permalink / raw) To: David Hildenbrand, linux-mm; +Cc: Wolfgang Mauerer, Mario Mintel On 23/08/2021 12:33, David Hildenbrand wrote: > On 23.08.21 12:16, Ralf Ramsauer wrote: >> >> >> On 23/08/2021 10:02, David Hildenbrand wrote: >>> On 20.08.21 15:13, Ralf Ramsauer wrote: >>>> Dear mm folks, >>>> >>>> I have an issue, where it would be great to have a COW-backed virtual >>>> memory area within an userspace process. I know there's the possibility >>>> to have a file-backed MAP_SHARED vma, which is later duplicated with >>>> MAP_PRIVATE, but that's not exactly what I'm looking for. >>>> >>>> Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and >>>> PROT_RW. Userspace happily writes to/reads from it. At some point in >>>> time, I want to 'snapshot' that single VMA within the context of the >>>> process and without the need to fork(). Say there's something like >>>> >>>> a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0); >>>> [... fill a ...] >>>> >>>> b = mmdup(a, len, PROT_READ); >>>> >>>> b shall be the new base pointer of a new VMA that is backed by COW >>>> mechanisms. After mmdup, those regular COW mechanisms do the rest: both >>>> VMAs (a and b) will fault on subsequent writes and duplicate the >>>> previously shared physical mapping, pretty much what cow_fault or >>>> shared_fault does. >>>> >>>> Afaict, this, or at least something like this is currently not >>>> supported >>>> by the kernel. Is that correct? If so, why? Generally spoken, is it a >>>> bad idea? >>> >>> Not sure if it helps (most probably not), QEMU uses uffd-wp for >>> background snapshots of VM memory. It's different, though, as you'll >>> only have a single mapping and will be catching modifications to your >>> single mapping, such that you can "safe away" relevant snapshot pages >>> before any modifications. >> >> Thanks for the pointer, David. I'll have a look. >> >>> >>> You mention "both VMAs (a and b) will fault on subsequent writes", so >>> would you actually be allowing PROT_WRITE access to b ("snapshot")? >>> >> >> In general, yes, both should be allowed to be PROT_WRITE. So no matter >> "which side" causes the fault, simply both will lead to duplication. >> >> If it would make things easier, then it would also be absolutely fine to >> have the snapshot PROT_READ, which would suffice my requirements as well. > > I recall that Redis has very similar requirements for live snapshotting. 100 points, you just managed to figure out what we're exackty working on! ;-) > They used to handle it via fork() just as you described as I was told. I Right, and fork() is damn slow, especially when forking large mappings. A simple mmap() of the same area (w/o population) is at least 4x faster. And you don't have to do all the stuff that's implied by fork, and you actually don't need. > don't know if they already switched to uffd-wp, but I would guess they > already did, because they were another excellent use case for uffd-wp > > https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg02955.html > > You can handle COW manually in user space that way > > 1. Creating a second anonymous mapping > 2. Registering a UFFD-WP handler on the original mapping > 3. WP-protecting the original mapping via UFFD > 4. Tracking in a bitmap which pages were already copied Ok, great, thanks, I'll have a look into that one! > > So when you get notified about a WP event, you copy the page manually to > the second mapping, un-protect the page, and remember in the bitmap that > the page has been copied. > > When reading the snapshot, you have to take a look at the bitmap to > figure out if you have to read a specific page from the original, or > from the second mapping. But you won't be able to just read the second > mapping. (question would be, if that is really required or can be > worked-around) Thanks a bunch! Ralf ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-08-23 10:49 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-08-20 13:13 COW in userspace Ralf Ramsauer 2021-08-20 23:12 ` Jerome Glisse 2021-08-23 8:02 ` David Hildenbrand 2021-08-23 10:16 ` [EXT] " Ralf Ramsauer 2021-08-23 10:33 ` David Hildenbrand 2021-08-23 10:49 ` Ralf Ramsauer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).