On Wed, Mar 9, 2022 at 1:22 AM Linus Torvalds wrote: > On Tue, Mar 8, 2022 at 3:25 PM Andreas Gruenbacher wrote: > > > > Seems to be working on s390x for this test case at least; the kind of > > trace I'm getting is: > > Good. > > > This shows bursts of successful fault-ins in gfs2_file_read_iter > > (read_fault). The pauses in between might be caused by the storage; > > I'm not sure. > > Don't know about the pauses, but the burst size might be impacted by that > > +       const size_t max_size = 4 * PAGE_SIZE; > > thing that limits how many calls to fixup_user_fault() we do per > fault_in_safe_writeable(). > > So it might be worth checking if that value seems to make any difference. > > > I'd still let the caller of fault_in_safe_writeable() decide how much > > memory to fault in; the tight cap in fault_in_safe_writeable() doesn't > > seem useful. > > Well, there are real latency concerns there - fixup_user_fault() is > not necessarily all that low-cost. I just don't know if making the GUP based approach work instead of switching to fixup_user_fault(), or introducing something else, would make enough of a performance difference to be worth it. > And it's actually going to be worse when we have the sub-page coloring > issues happening, and will need to probe at a 128-byte granularity > (not on s390, but on arm64). > > At that point, we almost certainly will need to have a "probe user > space non-destructibly for writability" instruction (possibly > extending on our current arch_futex_atomic_op_inuser() > infrastructure). Let me add Catalin Marinas to the CC. From what I took from the previous discussion, probing at a sub-page granularity won't be necessary for bytewise copying: when the address we're trying to access is poisoned, fault_in_*() will fail; when we get a short result, that will take us to the poisoned address in the next iteration. The problematic case was copying objects that cross fault domains, when we're getting an all-or-nothing result back from the copying and the address we're trying to fault_in_*() isn't the address of the actual fault. The fix for those cases is to pass back the address of the actual fault in one way or another. > So honestly, if you do IO on areas that will get page faults on them, > to some degree it's a "you are doing something stupid, you get what > you deserve". This code should _work_, it should not have to worry > about users having bad patterns (where "read data into huge cold > mappings under enough memory pressure that it causes access bit faults > in the middle of the read" very much counts as such a bad pattern). With a large enough buffer, a simple malloc() will return unmapped pages, and reading into such a buffer will result in fault-in. So page faults during read() are actually pretty normal, and it's not the user's fault. In my test case, the buffer was pre-initialized with memset() to avoid those kinds of page faults, which meant that the page faults in gfs2_file_read_iter() only started to happen when we were out of memory. But that's not the common case. > > Also, you want to put in an extra L here: > > > Signed-off-by: linus Torvalds > > Heh. Fixed locally. Attached is a revised patch; only lightly tested so far. Changes: * Fix the function description. * No need for untagged_addr() as that is handled in fixup_user_fault(). * Get rid of max_size: it really makes no sense to second-guess what the caller needs. In cases where fault_in_safe_writeable() is used for buffers larger than a handful of pages, it is entirely the caller's responsibility to scale back the fault-in size in anticipation of or in reaction to page-out. * Use the same control flow as in fault_in_readable(); there is no need for anything more complicated anymore. * The same patch description still applies. Thanks, Andreas diff --git a/mm/gup.c b/mm/gup.c index f1bf3a1f6d109..5e777049bdf41 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1841,9 +1841,9 @@ EXPORT_SYMBOL(fault_in_writeable); * @uaddr: start of address range * @size: length of address range * - * Faults in an address range using get_user_pages, i.e., without triggering - * hardware page faults. This is primarily useful when we already know that - * some or all of the pages in the address range aren't in memory. + * Faults in an address range for writing. This is primarily useful when we + * already know that some or all of the pages in the address range aren't in + * memory. * * Other than fault_in_writeable(), this function is non-destructive. * @@ -1856,46 +1856,33 @@ EXPORT_SYMBOL(fault_in_writeable); */ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size) { - unsigned long start = (unsigned long)untagged_addr(uaddr); - unsigned long end, nstart, nend; + const unsigned int fault_flags = FAULT_FLAG_WRITE | FAULT_FLAG_KILLABLE; + unsigned long start = (unsigned long)uaddr, end; struct mm_struct *mm = current->mm; - struct vm_area_struct *vma = NULL; - int locked = 0; - nstart = start & PAGE_MASK; + if (unlikely(size == 0)) + return 0; end = PAGE_ALIGN(start + size); - if (end < nstart) + if (end < start) end = 0; - for (; nstart != end; nstart = nend) { - unsigned long nr_pages; - long ret; - if (!locked) { - locked = 1; - mmap_read_lock(mm); - vma = find_vma(mm, nstart); - } else if (nstart >= vma->vm_end) - vma = vma->vm_next; - if (!vma || vma->vm_start >= end) - break; - nend = end ? min(end, vma->vm_end) : vma->vm_end; - if (vma->vm_flags & (VM_IO | VM_PFNMAP)) - continue; - if (nstart < vma->vm_start) - nstart = vma->vm_start; - nr_pages = (nend - nstart) / PAGE_SIZE; - ret = __get_user_pages_locked(mm, nstart, nr_pages, - NULL, NULL, &locked, - FOLL_TOUCH | FOLL_WRITE); - if (ret <= 0) - break; - nend = nstart + ret * PAGE_SIZE; + mmap_read_lock(mm); + if (!PAGE_ALIGNED(start)) { + if (fixup_user_fault(mm, start, fault_flags, NULL)) + return size; + start = PAGE_ALIGN(start); } - if (locked) - mmap_read_unlock(mm); - if (nstart == end) - return 0; - return size - min_t(size_t, nstart - start, size); + while (start != end) { + if (fixup_user_fault(mm, start, fault_flags, NULL)) + goto out; + start += PAGE_SIZE; + } + mmap_read_unlock(mm); + +out: + if (size > (unsigned long)uaddr - start) + return size - ((unsigned long)uaddr - start); + return 0; } EXPORT_SYMBOL(fault_in_safe_writeable);