From: Andrea Arcangeli <aarcange@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Borislav Petkov <bp@suse.de>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
kvm@vger.kernel.org
Subject: Re: [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults
Date: Sat, 22 Jun 2019 13:51:42 -0400 [thread overview]
Message-ID: <20190622175142.GA32455@redhat.com> (raw)
In-Reply-To: <20190522141803.c6714f96f57612caaac5d19b@linux-foundation.org>
Hello everyone,
On Wed, May 22, 2019 at 02:18:03PM -0700, Andrew Morton wrote:
> > arch/x86/kernel/fpu/signal.c:198:8-31: -> gup with !pages
This simply had not to return -EFAULT if ret < nr_pages.. but ret >= 0.
Instead it did:
if (ret == nr_pages)
goto retry;
return -EFAULT;
That was the bug and the correct code would have been:
ret = get_user_pages_unlocked(pages=NULL)
if (ret < 0)
return -EFAULT;
goto retry;
This eventually should have worked fine but it was less efficient
because it's still acting in a full prefault mode and it just tells
GUP that pages = NULL and so all it is trying to do is to issue the
blocking I/O after the mmap_sem has been released already.
Overall the solution applied in commit
b81ff1013eb8eef2934ca7e8cf53d553c1029e84 looks nicer.
Alternatively it could have used down_read(); get_user_pages(); which
prevents get_user_pages to drop the mmap_sem and break the loop if
some blocking I/O had to be executed outside mmap_sem. But that would
have the side effect of breaking userfaultfd (uffd requires
gup_locked/unlocked and FAULT_FLAG_ALLOW_RETRY to be set in the fault
flags).
Eventually we need to allow VM_FAULT_RETRY to be returned even if
FOLL_TRIED is set, so in theory get_user_pages_unlocked(pages=NULL) in
a loop must eventually stop returning VM_FAULT_RETRY. FOLL_TRIED could
still disambiguate if VM_FAULT_RETRY should or should not be returned
so that it is only returned only if it cannnot be avoided
(i.e. userfaultfd case).
With gup_unlocked(pages=NULL) however all we are interested about is
to execute the blocking I/O and we don't care to map anything in the
pagetables. A later page fault has to happen anyway for sure because
pages was == NULL, it just needs to be a fast one.
> > arch/x86/mm/mpx.c:423:11-25: -> gup with !pages
Note that get_user_pages is never affected by whatever change after
the below, !locked check in gup_locked:
if (!locked)
/* VM_FAULT_RETRY couldn't trigger, bypass */
return ret;
The bypass means when locked is NULL, there is a 1:1 bypass from
__get_user_pages<->get_user_pages and the VM_FAULT_RETRY dance never
runs.
get_user_pages in fact can't support userfaultfd, which makes ptrace
and core dump and the hwpoison non blocking in VM_FAULT_RETRY.
All places that must support userfaultfd must use
get_user_pages_unlocked/locked or somehow end up with
FAULT_FLAG_ALLOW_RETRY set in the fault flags.
> > virt/kvm/async_pf.c:90:1-22: -> gup with !pages
Didn't this get slowed down with the commit
df17277b2a85c00f5710e33ce238ba4114687a28?
I mean it was a feature not a bug to skip that additional
__get_user_pages(FOLL_TRIED).
> > virt/kvm/kvm_main.c:1437:6-20: -> gup with !pages
Like for mpx.c get_user_pages is agnostic to all these gup_locked
changes because it sets locked = NULL, it couldn't break the loop
early because it couldn't return VM_FAULT_RETRY.
>
> OK.
Commit df17277b2a85c00f5710e33ce238ba4114687a28 is now applied.
So I think the effect it has is to make async_pf.c slower and we
didn't solve anything.
There are two __get_user_pages:
1) ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
vmas, locked);
if (called by get_user_pages)
return ret; /* bypass the whole VM_FAULT_RETRY logic */
*locked = 1;
lock_dropped = true;
down_read(&mm->mmap_sem);
2) ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
pages, NULL, NULL);
The problem introduced is that 2) is getting executed with pages==NULL
but there's no point to ever run 2) with pages = NULL.
async_pf especially uses nr_pages == 1, so it couldn't get any more
optimal than it already was.
Before df17277b2a85c00f5710e33ce238ba4114687a28 we broke the loop as
soon as the first __get_user_pages returned VM_FAULT_RETRY.
We can argue if we shouldn't have broken the loop and we should have
kept executing only the first __get_user_pages (marked "1)" above) for
the whole range, but nr_pages == 1 is common and in such case there's
no difference between the two behaviors.
The prefetch callers with nr_pages == 1, didn't even check the retval
at all:
down_read(&mm->mmap_sem);
get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL,
&locked); ^^^^ pages NULL
// retval ignored
It should probably check for retval < 0... but the fault will be
retried for good later still with get_user_pages_unlocked() but with
pages != NULL, so it'll find out later if it's a segfault.
Now if we change the code to skip the second __get_user_pages it's not
clear if we can return nr_pages because we may still not have faulted
in the whole range in the pagetables. I guess we could still return
nr_pages even if we scanned the whole range with only the first of the
two __get_user_pages. However if you had mmu notifier registered in
the range nr_pages would have stronger semantics if you would execute
2) too, but then without pages array not-NULL such stronger semantics
cannot be taken advantage of anyway, because you don't know where
those pages are and you can't map them in a secondary MMU even if you
execute the line 2).
I personally preferred the older code which should at least in theory
run faster, it just required documentation that if "pages == NULL"
we'll break the loop early because it has to be a "prefetch" attempt
and it must be retried until nr_pages == ret.
Either that or add a "continue" to skip the second __get_user_pages in
line 2) above and then returning nr_pages to indicate VM_FAULT_RETRY
may very well have been returned on all page offsets in the virtual
range. That will behave the same for async_pf.c because nr_pages == 1.
When VM_FAULT_RETRY is returned all I/O should be complete (no matter
if network or disk with userfaultfd or just pagecache readpage on
filebacked kernel faults) and only a minor fault is required to obtain
the page. But it is better to defer that second minor fault to the
point where "pages" already become != NULL, so we end up calling
__get_user_pages 2 times instead of 3 times.
Thanks,
Andrea
next prev parent reply other threads:[~2019-06-22 17:51 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-26 17:33 [PATCH] x86/fpu: Use fault_in_pages_writeable() for pre-faulting Sebastian Andrzej Siewior
2019-05-26 17:35 ` Sebastian Andrzej Siewior
2019-05-26 19:25 ` Hugh Dickins
2019-05-28 11:54 ` My emacs problem -- was " Pavel Machek
2019-05-29 4:18 ` Andrew Morton
2019-05-29 7:25 ` [PATCH v2] " Sebastian Andrzej Siewior
2019-05-14 14:29 ` [PATCH] mm/gup: continue VM_FAULT_RETRY processing event for pre-faults Mike Rapoport
2019-05-16 16:25 ` Andrei Vagin
2019-05-21 15:53 ` Mike Rapoport
2019-05-22 19:21 ` Andrew Morton
2019-05-22 19:43 ` Sebastian Andrzej Siewior
2019-05-24 22:22 ` Hugh Dickins
2019-05-25 8:45 ` Sebastian Andrzej Siewior
2019-05-25 18:09 ` Hugh Dickins
2019-05-26 19:36 ` Sebastian Andrzej Siewior
2019-05-26 20:17 ` Hugh Dickins
2019-05-26 17:25 ` Pavel Machek
2019-05-22 20:38 ` Mike Rapoport
2019-05-22 21:18 ` Andrew Morton
2019-06-22 17:51 ` Andrea Arcangeli [this message]
2019-06-06 17:25 ` [tip:x86/urgent] x86/fpu: Use fault_in_pages_writeable() for pre-faulting tip-bot for Hugh Dickins
2019-05-29 21:29 ` [PATCH v2] " Chris Wilson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190622175142.GA32455@redhat.com \
--to=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=bigeasy@linutronix.de \
--cc=bp@suse.de \
--cc=dgilbert@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rppt@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).