I wanted to chime in here following an individual exchange with Andrea, because I have been using the userfaultfd remap functionality downstream for a research project at the University of Colorado. I've included links below to 4.3 and 4.10_rc6 kernels patched to enable userfaultfd remap. However, I would hit a kernel bug with 4.3-4.9 or experience application failure with invalid page messages with 4.10_rc6. I'm hoping the cause might be more obvious to someone on this list. These errors occur when concurrent threads are reading and writing to the userfaultfd region, while a separate process is performing UFFDIO_REMAP operations on the same region. Our use case requires this ability to remove memory from the region. The error is not observed if only a single thread is reading and writing to the userfaultfd region. There are 2 attachments. - 4.10_dmesg.txt (from the 4.10 kernel, where the application will hang after these messages) - 4.3-vmcore-dmesg.txt (from the 4.3 kernel BUG. I also have vmcore from this crash) The 4.10_rc6 kernel with patches: https://github.com/blakecaldwell/userfault-kernel/tree/userfault_4.10_rc6 The 4.3 kernel with patches: https://github.com/blakecaldwell/userfault-kernel/commits/4.3_userfault note that my patch for 4.3 here: https://github.com/blakecaldwell/userfault-kernel/commit/8bbcbed8d61dcb8533af67bb00f41a0df66e0535 ...is no longer part of the above 4.10 kernel in lieu of: https://github.com/blakecaldwell/userfault-kernel/commit/15a77c6fe494f4b1757d30cd137fe66ab06a38c3 I'm hopeful for 3 things out of this: 1. to add that remap functionality within userfaultfd is critical for use case, and we hope that it can make it into mainline in the future. 2. to get more eyes on the patches that might provide some into site into why we see failures with concurrent operation on a userfault-registered region 3. that the code above with patches will be useful to others interested in using the remap functionality Thanks, Blake > -----Original Message----- > From: > Sent: None > Subject: > > CC'ed linux-mm with your ACK as this may be of general interest, plus CC'ed > others that expressed interest in UFFDIO_REMAP use cases. > > On Sun, Feb 19, 2017 at 04:35:54PM +0000, krigovski, louis wrote: > > Hi, > > I am looking at your slides from LinuxCon Toronto 2016. > > > > You mention functionality > > > > 1. "Removing the memory atomically... after adding it with UFFDIO_COPY" > > > > Is this possible? I don’t see how you can unmap page and give copy of it > to the caller. > > Originally removing the memory atomically was the only way and there was > not UFFDIO_COPY. > > The non linear relocation had some constraint (the source page had to be not- > shared so rmap re-linearization was possible). > > The main complexity in UFFDIO_REMAP is about the re-linearization of rmap > for the pages moved post remap, copying atomically doesn't require any rmap > change instead so it's simpler. > > As long as the page is not shared solving the rmap is possible as the page will > not become non-linear post-UFFDIO_REMAP and I solved that already for anon > pages already in the old userfault19 branch (last branch where I included > UFFDIO_REMAP, until it can be re-introduced later). > > The last UFFDIO_REMAP implementation is below, but it's only worthwhile to > remove memory, postcopy doesn't require it, but it would benefit distributed > shared memory implementations or similar usages requiring full memory > externalization. Others already asked for it. > > https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/log/?h=userfault19 > https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?h=userfault > 19&id=7a84c6b2af19bd2f989be849b4b8d1096e44d5ea > > The primary reason why UFFDIO_REMAP was deferred is that UFFDIO_COPY is > not only simpler but it's faster too, for the postcopy live migration case (we > verified it with benchmarks just in case). > > The reason remap is slower is because of the IPIs that need to be delivered to > all CPUs that mapped the address space of the source virtual range to > flush/invalidate the TLB. > > I think IPI deferral and batching would be possible to skip IPIs for every single > page UFFDIO_REMAPped (using a virtual range ring whose TLB flush is only > done at ring-overflow), but it's tricky and it'd have more complext semantics > than mremap. The above implementation in the link retains the same strict > semantics as mremap() but it's slower than UFFDIO_COPY as result. When > UFFDIO_REMAP is used to remove memory from the final destination however > the IPI cannot be deferred so if only used to remove memory the current > implementation would be already optimal. > > About the WP support it kind of works but I've (non-kernel-crashing) > bugreports pending for KVM get_user_pages access that we need to solve > before it's fully workable for things like postcopy live snapshotting too. So it's > not finished. We focused on completing the hugetlbfs shmem and non > cooperative features in time for 4.11 merge window and so now we can > concentrate on finishing the WP support. > > I've more patches pending than what's currently in the aa.git userfault main > branch: the main objective of the pending work is to have a user (non hw > interpreted) flag on pagetables and swap entries that can differentiate when a > page is wrprotected by other means or through UFFDIO_WRITEPROTECT. Just > like the soft dirty pte/swapentry flag. So that there will be no risk of false > positive WP faults post > fork() or anything that wrprotect the pagetables by other means. Then even > soft dirty users can be converted to use userfaultfd WP support that has a > computational complexity lower than O(N), and just like PML hw VT feature, > won't require to scan all pagetables to find which pages have been re-dirtied. > > The WP feature isn't just good for distributed shared memory combined with > UFFDIO_REMAP to remove memory, but it'll be useful for postcopy live > snapshotting and for regular databases that may be using fork() instead. fork() > is not ideal because databases run into trouble with THP WP faults that turn > out to be less efficient than PAGE_SIZEd WP faults for that specific > snapshotting use case. Furthermore spawning a userfaul thread will be more > efficient than forking off a new process and there will be no TLB trashing during > the snapshotting. With user page faults it's always userland to decides the > granularity of the fault resolution and THP in-kernel will cope with whatever > granularity the userfault handler thread decides. In the snapshotting case the > lower page size the kernel supports is always more efficient and creates less > memory footprint too. Last but not the least, userfaultfd WP will allow the > snapshotting to decide if to throttle on I/O if too much memory is getting > allocated despite using smallest page size granularity available (fork() instead > doesn't allow I/O throttling, so no matter if THP is on or off, the max memory > usage can reach twice the size of the db cache, which may trigger OOM in > containers or similar). > > Thanks, > Andrea > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to > majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org