On Wed, Mar 9, 2022 at 1:22 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, Mar 8, 2022 at 3:25 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Seems to be working on s390x for this test case at least; the kind of
> > trace I'm getting is:
>
> Good.
>
> > This shows bursts of successful fault-ins in gfs2_file_read_iter
> > (read_fault). The pauses in between might be caused by the storage;
> > I'm not sure.
>
> Don't know about the pauses, but the burst size might be impacted by that
>
> +       const size_t max_size = 4 * PAGE_SIZE;
>
> thing that limits how many calls to fixup_user_fault() we do per
> fault_in_safe_writeable().
>
> So it might be worth checking if that value seems to make any difference.
>
> > I'd still let the caller of fault_in_safe_writeable() decide how much
> > memory to fault in; the tight cap in fault_in_safe_writeable() doesn't
> > seem useful.
>
> Well, there are real latency concerns there - fixup_user_fault() is
> not necessarily all that low-cost.

I just don't know if making the GUP based approach work instead of
switching to fixup_user_fault(), or introducing something else, would
make enough of a performance difference to be worth it.

> And it's actually going to be worse when we have the sub-page coloring
> issues happening, and will need to probe at a 128-byte granularity
> (not on s390, but on arm64).
>
> At that point, we almost certainly will need to have a "probe user
> space non-destructibly for writability" instruction (possibly
> extending on our current arch_futex_atomic_op_inuser()
> infrastructure).

Let me add Catalin Marinas to the CC.

From what I took from the previous discussion, probing at a sub-page
granularity won't be necessary for bytewise copying: when the address
we're trying to access is poisoned, fault_in_*() will fail; when we get
a short result, that will take us to the poisoned address in the next
iteration.

The problematic case was copying objects that cross fault domains, when
we're getting an all-or-nothing result back from the copying and the
address we're trying to fault_in_*() isn't the address of the actual
fault.  The fix for those cases is to pass back the address of the
actual fault in one way or another.

> So honestly, if you do IO on areas that will get page faults on them,
> to some degree it's a "you are doing something stupid, you get what
> you deserve". This code should _work_, it should not have to worry
> about users having bad patterns (where "read data into huge cold
> mappings under enough memory pressure that it causes access bit faults
> in the middle of the read" very much counts as such a bad pattern).

With a large enough buffer, a simple malloc() will return unmapped
pages, and reading into such a buffer will result in fault-in.  So page
faults during read() are actually pretty normal, and it's not the user's
fault.

In my test case, the buffer was pre-initialized with memset() to avoid
those kinds of page faults, which meant that the page faults in
gfs2_file_read_iter() only started to happen when we were out of memory.
But that's not the common case.

> > Also, you want to put in an extra L here:
> > > Signed-off-by: linus Torvalds <torvalds@linux-foundation.org>
>
> Heh. Fixed locally.

Attached is a revised patch; only lightly tested so far.  Changes:

* Fix the function description.

* No need for untagged_addr() as that is handled in fixup_user_fault().

* Get rid of max_size: it really makes no sense to second-guess what the
  caller needs.  In cases where fault_in_safe_writeable() is used for
  buffers larger than a handful of pages, it is entirely the caller's
  responsibility to scale back the fault-in size in anticipation of or
  in reaction to page-out.

* Use the same control flow as in fault_in_readable(); there is no
  need for anything more complicated anymore.

* The same patch description still applies.

Thanks,
Andreas

diff --git a/mm/gup.c b/mm/gup.c
index f1bf3a1f6d109..5e777049bdf41 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1841,9 +1841,9 @@ EXPORT_SYMBOL(fault_in_writeable);
  * @uaddr: start of address range
  * @size: length of address range
  *
- * Faults in an address range using get_user_pages, i.e., without triggering
- * hardware page faults.  This is primarily useful when we already know that
- * some or all of the pages in the address range aren't in memory.
+ * Faults in an address range for writing.  This is primarily useful when we
+ * already know that some or all of the pages in the address range aren't in
+ * memory.
  *
  * Other than fault_in_writeable(), this function is non-destructive.
  *
@@ -1856,46 +1856,33 @@ EXPORT_SYMBOL(fault_in_writeable);
  */
 size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
 {
-	unsigned long start = (unsigned long)untagged_addr(uaddr);
-	unsigned long end, nstart, nend;
+	const unsigned int fault_flags = FAULT_FLAG_WRITE | FAULT_FLAG_KILLABLE;
+	unsigned long start = (unsigned long)uaddr, end;
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	int locked = 0;
 
-	nstart = start & PAGE_MASK;
+	if (unlikely(size == 0))
+		return 0;
 	end = PAGE_ALIGN(start + size);
-	if (end < nstart)
+	if (end < start)
 		end = 0;
-	for (; nstart != end; nstart = nend) {
-		unsigned long nr_pages;
-		long ret;
 
-		if (!locked) {
-			locked = 1;
-			mmap_read_lock(mm);
-			vma = find_vma(mm, nstart);
-		} else if (nstart >= vma->vm_end)
-			vma = vma->vm_next;
-		if (!vma || vma->vm_start >= end)
-			break;
-		nend = end ? min(end, vma->vm_end) : vma->vm_end;
-		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
-			continue;
-		if (nstart < vma->vm_start)
-			nstart = vma->vm_start;
-		nr_pages = (nend - nstart) / PAGE_SIZE;
-		ret = __get_user_pages_locked(mm, nstart, nr_pages,
-					      NULL, NULL, &locked,
-					      FOLL_TOUCH | FOLL_WRITE);
-		if (ret <= 0)
-			break;
-		nend = nstart + ret * PAGE_SIZE;
+	mmap_read_lock(mm);
+	if (!PAGE_ALIGNED(start)) {
+		if (fixup_user_fault(mm, start, fault_flags, NULL))
+			return size;
+		start = PAGE_ALIGN(start);
 	}
-	if (locked)
-		mmap_read_unlock(mm);
-	if (nstart == end)
-		return 0;
-	return size - min_t(size_t, nstart - start, size);
+	while (start != end) {
+		if (fixup_user_fault(mm, start, fault_flags, NULL))
+			goto out;
+		start += PAGE_SIZE;
+	}
+	mmap_read_unlock(mm);
+
+out:
+	if (size > (unsigned long)uaddr - start)
+		return size - ((unsigned long)uaddr - start);
+	return 0;
 }
 EXPORT_SYMBOL(fault_in_safe_writeable);