linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [git pull] vfs.git get_user_pages_fast() conversion
@ 2017-11-17  3:02 Al Viro
  2017-11-17 20:50 ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2017-11-17  3:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

	A bunch of places switched to get_user_pages_fast().

The following changes since commit 2bd6bf03f4c1c59381d62c61d03f6cc3fe71f66e:

  Linux 4.14-rc1 (2017-09-16 15:47:51 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git work.get_user_pages_fast

for you to fetch changes up to 77478715ba9242017976fd01de189e77fa072f51:

  ceph: use get_user_pages_fast() (2017-09-22 23:14:41 -0400)

----------------------------------------------------------------
Al Viro (8):
      vchiq_2835_arm: switch to get_user_pages_fast()
      rapidio: switch to get_user_pages_fast()
      fsl_hypervisor: switch to get_user_pages_fast()
      via_dmablit(): use get_user_pages_fast()
      st: use get_user_pages_fast()
      atomisp: use get_user_pages_fast()
      pvr2fs: use get_user_pages_fast()
      ceph: use get_user_pages_fast()

 drivers/gpu/drm/via/via_dmablit.c                    |  6 +++---
 drivers/rapidio/devices/rio_mport_cdev.c             |  6 ++----
 drivers/scsi/st.c                                    |  6 +-----
 .../staging/media/atomisp/pci/atomisp2/hmm/hmm_bo.c  |  6 ++----
 .../interface/vchiq_arm/vchiq_2835_arm.c             | 20 +++++++-------------
 drivers/video/fbdev/pvr2fb.c                         |  4 +---
 drivers/virt/fsl_hypervisor.c                        |  4 ++--
 net/ceph/pagevec.c                                   |  4 ++--
 8 files changed, 20 insertions(+), 36 deletions(-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [git pull] vfs.git get_user_pages_fast() conversion
  2017-11-17  3:02 [git pull] vfs.git get_user_pages_fast() conversion Al Viro
@ 2017-11-17 20:50 ` Linus Torvalds
  2017-11-17 21:32   ` Al Viro
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2017-11-17 20:50 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel Mailing List, linux-fsdevel

On Thu, Nov 16, 2017 at 7:02 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>         A bunch of places switched to get_user_pages_fast().

I really would have liked a bit of a commentary.

Looking at the individual patches, I notice this, for example:

-               down_read(&current->mm->mmap_sem);
-               page_nr = get_user_pages((unsigned long)userptr,
-                                        (int)(bo->pgnr), 1, pages, NULL);
-               up_read(&current->mm->mmap_sem);
+               page_nr = get_user_pages_fast((unsigned long)userptr,
+                                        (int)(bo->pgnr), 1, pages);


from the atomisp conversion, and it made me throw up my hands in horror.

Not because the conversion was wrong, but because the original code is
so broken.

In particular, that "1" that is unchanged in the arguments is correct
in the conversion, but it was completely wrong in the original, even
if it happened to work.

it _should_ have been a FOLL_WRITE. Yes, it happens to have that
value, but it was broken.

(I note that a bit of grepping shows we have the same issue in a stale
comment in mm/ksm.c).

It would have been nice to see things like this mentioned in the commit message.

Because I'm pretty sure you actually _realized_ that as you made the
conversion, but there's no sign of that in the logs, because the
commit message just says

    atomisp: use get_user_pages_fast()

without mentioning how broken the old case was (even it if happened to work).

                   Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [git pull] vfs.git get_user_pages_fast() conversion
  2017-11-17 20:50 ` Linus Torvalds
@ 2017-11-17 21:32   ` Al Viro
  2017-11-18 21:45     ` Al Viro
                       ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Al Viro @ 2017-11-17 21:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, linux-fsdevel

On Fri, Nov 17, 2017 at 12:50:47PM -0800, Linus Torvalds wrote:

> Not because the conversion was wrong, but because the original code is
> so broken.
> 
> In particular, that "1" that is unchanged in the arguments is correct
> in the conversion, but it was completely wrong in the original, even
> if it happened to work.
> 
> it _should_ have been a FOLL_WRITE. Yes, it happens to have that
> value, but it was broken.
> 
> (I note that a bit of grepping shows we have the same issue in a stale
> comment in mm/ksm.c).
> 
> It would have been nice to see things like this mentioned in the commit message.
> 
> Because I'm pretty sure you actually _realized_ that as you made the
> conversion, but there's no sign of that in the logs, because the
> commit message just says
> 
>     atomisp: use get_user_pages_fast()
> 
> without mentioning how broken the old case was (even it if happened to work).

Point...  Frankly, my impression from the whole thing is that get_user_pages()
and get_user_pages_unlocked() calling conventions are overcomplicated, especially
since most of the callers either should be get_user_pages_fast() or happen
to be inside implementations of get_user_pages_fast().  Grepping for the
remaining callers right now yields just this:

arch/cris/arch-v32/drivers/cryptocop.c:2722:    err = get_user_pages((unsigned long int)(oper.indata + prev_ix),
arch/cris/arch-v32/drivers/cryptocop.c:2736:            err = get_user_pages((unsigned long int)oper.cipher_outdata,
arch/ia64/kernel/err_inject.c:145:      ret = get_user_pages(virt_addr, 1, FOLL_WRITE, NULL, NULL);
arch/x86/mm/mpx.c:550:  gup_ret = get_user_pages((unsigned long)addr, nr_pages,
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:646:            r = get_user_pages(userptr, num_pages, flags, p, NULL);
drivers/gpu/drm/radeon/radeon_ttm.c:571:                r = get_user_pages(userptr, num_pages, write ? FOLL_WRITE : 0,
drivers/infiniband/core/umem.c:194:             ret = get_user_pages(cur_base,
drivers/infiniband/hw/mthca/mthca_memfree.c:475:        ret = get_user_pages(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages, NUL
L);
drivers/infiniband/hw/qib/qib_user_pages.c:70:          ret = get_user_pages(start_page + got * PAGE_SIZE,
drivers/infiniband/hw/usnic/usnic_uiom.c:146:           ret = get_user_pages(cur_base,
drivers/media/pci/ivtv/ivtv-udma.c:127: err = get_user_pages_unlocked(user_dma.uaddr, user_dma.page_count,
drivers/media/pci/ivtv/ivtv-yuv.c:78:   y_pages = get_user_pages_unlocked(y_dma.uaddr,
drivers/media/pci/ivtv/ivtv-yuv.c:82:           uv_pages = get_user_pages_unlocked(uv_dma.uaddr,
drivers/media/v4l2-core/videobuf-dma-sg.c:188:  err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
drivers/misc/mic/scif/scif_rma.c:1399:          pinned_pages->nr_pages = get_user_pages(
drivers/misc/sgi-gru/grufault.c:201:    if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page, NULL) <= 0)
mm/mempolicy.c:829:     err = get_user_pages(addr & PAGE_MASK, 1, 0, &p, NULL);
virt/kvm/kvm_main.c:1326:       return get_user_pages(start, 1, flags, page, NULL);
virt/kvm/kvm_main.c:1333:       rc = get_user_pages(addr, 1, flags, NULL, NULL);
virt/kvm/kvm_main.c:1395:               npages = get_user_pages_unlocked(addr, 1, page, flags);

and even those are dubious - e.g. cris ones could bloody well become a pair of
get_user_pages_fast(), without ->mmap_sem being held across both calls.  Itanic
one is almost certainly buggered - we are not holding ->mmap_sem there.

Hell knows...  I wonder if exposing FOLL_... thing outside of mm/gup.c actually
makes sense.  With the users so heavily skewed towards just two combinations of
flags (0 and FOLL_WRITE)...

And for get_user_pages() itself it's even more ridiculous - vmalist (the last
argument) is non-NULL in only one caller.  Which uses it only to check if all
of the VMAs happen to be hugetlb ones, apparently.

FWIW, I wanted to trim the users of those two suckers and see what remains.
And then go through those with maintainers of subsystems in question, to
see what is really wanted there.  That's for the coming cycle, though...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [git pull] vfs.git get_user_pages_fast() conversion
  2017-11-17 21:32   ` Al Viro
@ 2017-11-18 21:45     ` Al Viro
  2017-11-21 18:35       ` Andrea Arcangeli
  2017-11-18 21:49     ` Dan Williams
  2017-11-19 21:23     ` mthca misuse of get_user_pages() (was Re: [git pull] vfs.git get_user_pages_fast() conversion) Al Viro
  2 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2017-11-18 21:45 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linux Kernel Mailing List, linux-fsdevel, Linus Torvalds

On Fri, Nov 17, 2017 at 09:32:15PM +0000, Al Viro wrote:

> And for get_user_pages() itself it's even more ridiculous - vmalist (the last
> argument) is non-NULL in only one caller.  Which uses it only to check if all
> of the VMAs happen to be hugetlb ones, apparently.
> 
> FWIW, I wanted to trim the users of those two suckers and see what remains.
> And then go through those with maintainers of subsystems in question, to
> see what is really wanted there.  That's for the coming cycle, though...

	Incidentally, why the hell do we need that notify_drop argument of
__get_user_pages_locked()?  Its only use is (and has always been)
        if (notify_drop && lock_dropped && *locked) {
                /*
                 * We must let the caller know we temporarily dropped the lock
                 * and so the critical section protected by it was lost.
                 */
                up_read(&mm->mmap_sem);
                *locked = 0;
        }
in the very end of __get_user_pages_locked().  There are 4 callers:
get_user_pages_locked() and get_user_pages_remote() pass true,
get_user_pages() and __get_user_pages_unlocked() - false.

get_user_pages() passes NULL for 'locked', so we _never_ get
to the body of that if() anyway - lock_dropped is only set if
locked != NULL.

And in __get_user_pages_unlocked() we have
        ret = __get_user_pages_locked(tsk, mm, start, nr_pages, pages, NULL,
                                      &locked, false, gup_flags | FOLL_TOUCH);
        if (locked)
                up_read(&mm->mmap_sem);

Suppose we passed true instead of false here.  If that if (notify_drop...)
still has not triggered, nothing has changed.  If it *has* triggered, we had
*locked (i.e. locked from the __get_user_pages_unlocked() stack frame)
non-zero, so we'd just have dropped ->mmap_sem just before returning to
__get_user_pages_unlocked() instead of doing that just after.  And set
*locked to zero, so that __get_user_pages_unlocked() won't end up dropping
it.

	Looks like we can bloody well get rid of that argument and do just
        if (lock_dropped && *locked) {
                /*
                 * We must let the caller know we temporarily dropped the lock
                 * and so the critical section protected by it was lost.
                 */
                up_read(&mm->mmap_sem);
                *locked = 0;
        }
in __get_user_pages_locked(), or am I missing something subtle there?  Andrea?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [git pull] vfs.git get_user_pages_fast() conversion
  2017-11-17 21:32   ` Al Viro
  2017-11-18 21:45     ` Al Viro
@ 2017-11-18 21:49     ` Dan Williams
  2017-11-19 21:23     ` mthca misuse of get_user_pages() (was Re: [git pull] vfs.git get_user_pages_fast() conversion) Al Viro
  2 siblings, 0 replies; 7+ messages in thread
From: Dan Williams @ 2017-11-18 21:49 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel

On Fri, Nov 17, 2017 at 1:32 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Nov 17, 2017 at 12:50:47PM -0800, Linus Torvalds wrote:
>
>> Not because the conversion was wrong, but because the original code is
>> so broken.
>>
>> In particular, that "1" that is unchanged in the arguments is correct
>> in the conversion, but it was completely wrong in the original, even
>> if it happened to work.
>>
>> it _should_ have been a FOLL_WRITE. Yes, it happens to have that
>> value, but it was broken.
>>
>> (I note that a bit of grepping shows we have the same issue in a stale
>> comment in mm/ksm.c).
>>
>> It would have been nice to see things like this mentioned in the commit message.
>>
>> Because I'm pretty sure you actually _realized_ that as you made the
>> conversion, but there's no sign of that in the logs, because the
>> commit message just says
>>
>>     atomisp: use get_user_pages_fast()
>>
>> without mentioning how broken the old case was (even it if happened to work).
>
> Point...  Frankly, my impression from the whole thing is that get_user_pages()
> and get_user_pages_unlocked() calling conventions are overcomplicated, especially
> since most of the callers either should be get_user_pages_fast() or happen
> to be inside implementations of get_user_pages_fast().  Grepping for the
> remaining callers right now yields just this:
>
> arch/cris/arch-v32/drivers/cryptocop.c:2722:    err = get_user_pages((unsigned long int)(oper.indata + prev_ix),
> arch/cris/arch-v32/drivers/cryptocop.c:2736:            err = get_user_pages((unsigned long int)oper.cipher_outdata,
> arch/ia64/kernel/err_inject.c:145:      ret = get_user_pages(virt_addr, 1, FOLL_WRITE, NULL, NULL);
> arch/x86/mm/mpx.c:550:  gup_ret = get_user_pages((unsigned long)addr, nr_pages,
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:646:            r = get_user_pages(userptr, num_pages, flags, p, NULL);
> drivers/gpu/drm/radeon/radeon_ttm.c:571:                r = get_user_pages(userptr, num_pages, write ? FOLL_WRITE : 0,
> drivers/infiniband/core/umem.c:194:             ret = get_user_pages(cur_base,
> drivers/infiniband/hw/mthca/mthca_memfree.c:475:        ret = get_user_pages(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages, NUL
> L);
> drivers/infiniband/hw/qib/qib_user_pages.c:70:          ret = get_user_pages(start_page + got * PAGE_SIZE,
> drivers/infiniband/hw/usnic/usnic_uiom.c:146:           ret = get_user_pages(cur_base,
> drivers/media/pci/ivtv/ivtv-udma.c:127: err = get_user_pages_unlocked(user_dma.uaddr, user_dma.page_count,
> drivers/media/pci/ivtv/ivtv-yuv.c:78:   y_pages = get_user_pages_unlocked(y_dma.uaddr,
> drivers/media/pci/ivtv/ivtv-yuv.c:82:           uv_pages = get_user_pages_unlocked(uv_dma.uaddr,
> drivers/media/v4l2-core/videobuf-dma-sg.c:188:  err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
> drivers/misc/mic/scif/scif_rma.c:1399:          pinned_pages->nr_pages = get_user_pages(
> drivers/misc/sgi-gru/grufault.c:201:    if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page, NULL) <= 0)
> mm/mempolicy.c:829:     err = get_user_pages(addr & PAGE_MASK, 1, 0, &p, NULL);
> virt/kvm/kvm_main.c:1326:       return get_user_pages(start, 1, flags, page, NULL);
> virt/kvm/kvm_main.c:1333:       rc = get_user_pages(addr, 1, flags, NULL, NULL);
> virt/kvm/kvm_main.c:1395:               npages = get_user_pages_unlocked(addr, 1, page, flags);
>
> and even those are dubious - e.g. cris ones could bloody well become a pair of
> get_user_pages_fast(), without ->mmap_sem being held across both calls.  Itanic
> one is almost certainly buggered - we are not holding ->mmap_sem there.
>
> Hell knows...  I wonder if exposing FOLL_... thing outside of mm/gup.c actually
> makes sense.  With the users so heavily skewed towards just two combinations of
> flags (0 and FOLL_WRITE)...
>
> And for get_user_pages() itself it's even more ridiculous - vmalist (the last
> argument) is non-NULL in only one caller.  Which uses it only to check if all
> of the VMAs happen to be hugetlb ones, apparently.

One note here, we've discovered that filesystem-dax mappings are
broken with respect to DMA and fallocate(PUNCH_HOLE). As a band-aid
we're looking to change rdma, video/media, and any other subsystem
that tries to pin pages indefinitely to fail in the vma_is_dax() case
with a new get_user_pages_longterm() helper [1]. Then the plan is to
follow on with new infrastructure to register memory with a file
lease. We need an interface for the kernel to notify userspace that
the pages it pinned need to be unpinned because the file blocks are
being deallocated (where 'file blocks' and 'memory pages' are one in
the same in the dax case).

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-November/013295.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* mthca misuse of get_user_pages() (was Re: [git pull] vfs.git get_user_pages_fast() conversion)
  2017-11-17 21:32   ` Al Viro
  2017-11-18 21:45     ` Al Viro
  2017-11-18 21:49     ` Dan Williams
@ 2017-11-19 21:23     ` Al Viro
  2 siblings, 0 replies; 7+ messages in thread
From: Al Viro @ 2017-11-19 21:23 UTC (permalink / raw)
  To: linux-rdma
  Cc: Linux Kernel Mailing List, linux-fsdevel, Linus Torvalds, Doug Ledford

On Fri, Nov 17, 2017 at 09:32:15PM +0000, Al Viro wrote:

[snip]
> drivers/infiniband/hw/mthca/mthca_memfree.c:475:        ret = get_user_pages(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages, NULL);
[snip]

> Itanic
> one is almost certainly buggered - we are not holding ->mmap_sem there.

So's the mthca one - that caller (mthca_map_user_db()) does not take ->mmap_sem
and at least some of its calls are immediately preceded by copy_from_user().
If we don't have ->mmap_sem there, this get_user_pages() is oopsably racy; if
we do, those copy_from_user() calls is deadlock-prone.

To make things even nastier, mthca_map_user_db() does grab a mutex around the
chunk that contains get_user_pages() call:
        mutex_lock(&db_tab->mutex);
several lines prior.  And _that_ is taken on a whole lot of codepaths; if any of
those happen to be under ->mmap_sem, we can't grab ->mmap_sem just around that
get_user_pages().  I've poked around a bit, but I'm really unfamiliar with that
code...  The questions so far:
	* can mthca_map_user_db() ever be called under ->mmap_sem?  Looks like
it shouldn't, but...
	* do we really need that get_user_pages() under ->mutex?  Would something
like
	....
        if (db_tab->page[i].refcount) {
                ++db_tab->page[i].refcount;
                goto out;
        }
        mutex_unlock(&db_tab->mutex);
	ret = get_user_pages_fast(uaddr & PAGE_MASK, 1, true, pages);
        if (ret < 0)
                return ret;
        mutex_lock(&db_tab->mutex);
        if (unlikely(db_tab->page[i].refcount)) {
		/* lost the race */
		put_page(pages[0]);
                ++db_tab->page[i].refcount;
                goto out;
	}
	....
in place of the current
        if (db_tab->page[i].refcount) {
                ++db_tab->page[i].refcount;
                goto out;
        }

        ret = get_user_pages(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages, NULL);
        if (ret < 0)
                goto out;
be a usable solution?

Comments?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [git pull] vfs.git get_user_pages_fast() conversion
  2017-11-18 21:45     ` Al Viro
@ 2017-11-21 18:35       ` Andrea Arcangeli
  0 siblings, 0 replies; 7+ messages in thread
From: Andrea Arcangeli @ 2017-11-21 18:35 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel Mailing List, linux-fsdevel, Linus Torvalds

On Sat, Nov 18, 2017 at 09:45:31PM +0000, Al Viro wrote:
> in __get_user_pages_locked(), or am I missing something subtle there?  Andrea?

You're not missing anything as far as the logic is concerned.

However see the __always_inline, I added "notify_drop" purely to
optimize away such branch at build time so that gcc doesn't need to
include in every different copy it does of it. "notify_drop" is known
at build time and constant true/false, the other variables are not.

Either you drop the __always_inline or "notify_drop" makes sense to me
to keep to reduce the size of the inline and run faster at runtime.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-11-21 18:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-17  3:02 [git pull] vfs.git get_user_pages_fast() conversion Al Viro
2017-11-17 20:50 ` Linus Torvalds
2017-11-17 21:32   ` Al Viro
2017-11-18 21:45     ` Al Viro
2017-11-21 18:35       ` Andrea Arcangeli
2017-11-18 21:49     ` Dan Williams
2017-11-19 21:23     ` mthca misuse of get_user_pages() (was Re: [git pull] vfs.git get_user_pages_fast() conversion) Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).