From: Hugh Dickins <email@example.com>
To: Roland Dreier <firstname.lastname@example.org>
Andrea Arcangeli <email@example.com>,
Subject: Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
Date: Thu, 26 Jan 2012 12:01:53 -0800 (PST) [thread overview]
Message-ID: <alpine.LSU.firstname.lastname@example.org> (raw)
On Wed, 25 Jan 2012, Roland Dreier wrote:
> From: Roland Dreier <email@example.com>
> Right now, we always pass write==1 to get_user_pages(), even when we
> only intend to read the memory. We pass force==1 if we're going for
> read-only and force==0 if we want writable. The reasoning behind this
> seems to be contained in this out-of-tree changelog from 2005:
> Always ask get_user_pages() for writable pages, but pass force=1
> if the consumer has only asked for read-only pages. This fixes a
> problem registering memory that has just been allocated but not
> touched yet, while allowing registration of read-only memory to
> continue to work.
> However, I don't think the mm works like this today, and indeed GUP
> will fault in pages for an untouched read-only mapping just fine with
> write and force set to 0. In fact, always passing 1 for write causes
> problems with modern kernels, because we end up hitting the "early
> C-O-W break" case in __do_fault(), even for read-only mappings where
> this makes no sense.
> Signed-off-by: Roland Dreier <firstname.lastname@example.org>
I'd love to say go ahead with this, it looks so obviously correct...
... so obviously correct that it made me wonder why on earth the old
code was writing for read, and how to make make sense of the paragraph
you very helpfully quote above.
I think this is largely about the ZERO_PAGE. If you do a read fault
on an untouched anonymous area, it maps in the ZERO_PAGE, and will
only give you your own private zeroed page when there's a write fault
to touch it.
I think your ib_umem_get() is making sure to give the process its own
private zeroed page: if the area is PROT_READ, MAP_PRIVATE, userspace
will not be wanting to write into it, but presumably it expects to see
data placed in that page by the underlying driver, and it would be very
bad news if the driver wrote its data into the ZERO_PAGE.
We did go through a phase (2.6.24 through 2.6.31) of not using the
ZERO_PAGE, but that presented its own problems, so it was reinstated.
So, ugly as it looks, I think you have to stick with write=1 force=!write.
Does that make sense now?
And although the ZERO_PAGE gives the most vivid example, I think it goes
beyond that to the wider issue of pages COWed after fork() - GUPping in
this way fixes the easy cases (and I've no desire again to go down that
rathole of fixing the most general cases).
> This patch comes from me trying to do userspace RDMA on a memory
> region exported from a character driver and mapped with
> mmap(... PROT_READ, MAP_PRIVATE ...)
> The character driver has a trivial mmap method that just sets vm_ops
> and and equally trivial fault method that essentially just does
> vmf->page = vmalloc_to_page(buf + (vmf->pgoff << PAGE_SHIFT));
> ie the most elementary way to export a vmalloc'ed buffer to userspace.
> However, when I tried doing
> ibv_reg_mr(... IBV_ACCESS_REMOTE_READ ...)
> in userspace on that mmap region, I found that COW was happening and
> so neither userspace nor the registered memory ended up pointing at
> the kernel buffer anymore, exactly because of the COW in __do_fault()
> I mention in the changelog above.
> The patch below fixes my test case, and doesn't seem to break any of
> the ibverbs examples and other simple tests of userspace verbs that I
> tried. But that's far from an exhaustive test suite.
> I'd definitely appreciate comments from MM experts here, since I'm not
> positive of my understand of G-U-P and friends, and I don't want to
> apps because this is wrong in some special case I didn't try.
> Also testing from anyone with an RDMA app that does anything at all
> fancy with memory allocation or registration would be helpful.
> PS Let me know if I didn't go on long enough about this one-line patch
> and I can write some more.
> drivers/infiniband/core/umem.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 71f0c0f..fb5abd3 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> ret = get_user_pages(current, current->mm, cur_base,
> min_t(unsigned long, npages,
> PAGE_SIZE / sizeof (struct page *)),
> - 1, !umem->writable, page_list, vma_list);
> + umem->writable, 0, page_list, vma_list);
> if (ret < 0)
> goto out;
next prev parent reply other threads:[~2012-01-26 20:02 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-26 5:59 [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters Roland Dreier
2012-01-26 20:01 ` Hugh Dickins [this message]
2012-01-26 22:45 ` Roland Dreier
2012-01-27 17:28 ` Roland Dreier
2012-01-28 2:31 ` Hugh Dickins
2012-01-28 19:25 ` Jason Gunthorpe
2012-01-30 19:19 ` Roland Dreier
2012-01-28 2:19 ` Hugh Dickins
2012-01-30 19:16 ` Roland Dreier
2012-01-30 20:20 ` Andrea Arcangeli
2012-02-06 17:46 ` Roland Dreier
2012-01-30 20:34 ` Hugh Dickins
2012-02-06 17:39 ` Roland Dreier
2012-02-07 20:39 ` Hugh Dickins
2012-02-08 23:10 ` Hugh Dickins
2012-02-09 17:50 ` Roland Dreier
2012-02-09 22:57 ` Hugh Dickins
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).