linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
@ 2012-01-26  5:59 Roland Dreier
  2012-01-26 20:01 ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Roland Dreier @ 2012-01-26  5:59 UTC (permalink / raw)
  To: linux-rdma; +Cc: Andrea Arcangeli, Hugh Dickins, linux-mm, linux-kernel

From: Roland Dreier <roland@purestorage.com>

Right now, we always pass write==1 to get_user_pages(), even when we
only intend to read the memory.  We pass force==1 if we're going for
read-only and force==0 if we want writable.  The reasoning behind this
seems to be contained in this out-of-tree changelog from 2005:

    Always ask get_user_pages() for writable pages, but pass force=1
    if the consumer has only asked for read-only pages.  This fixes a
    problem registering memory that has just been allocated but not
    touched yet, while allowing registration of read-only memory to
    continue to work.

However, I don't think the mm works like this today, and indeed GUP
will fault in pages for an untouched read-only mapping just fine with
write and force set to 0.  In fact, always passing 1 for write causes
problems with modern kernels, because we end up hitting the "early
C-O-W break" case in __do_fault(), even for read-only mappings where
this makes no sense.

Signed-off-by: Roland Dreier <roland@purestorage.com>
---
This patch comes from me trying to do userspace RDMA on a memory
region exported from a character driver and mapped with

    mmap(... PROT_READ, MAP_PRIVATE ...)

The character driver has a trivial mmap method that just sets vm_ops
and and equally trivial fault method that essentially just does

    vmf->page = vmalloc_to_page(buf + (vmf->pgoff << PAGE_SHIFT));

ie the most elementary way to export a vmalloc'ed buffer to userspace.

However, when I tried doing

    ibv_reg_mr(... IBV_ACCESS_REMOTE_READ ...)

in userspace on that mmap region, I found that COW was happening and
so neither userspace nor the registered memory ended up pointing at
the kernel buffer anymore, exactly because of the COW in __do_fault()
I mention in the changelog above.

The patch below fixes my test case, and doesn't seem to break any of
the ibverbs examples and other simple tests of userspace verbs that I
tried.  But that's far from an exhaustive test suite.

I'd definitely appreciate comments from MM experts here, since I'm not
positive of my understand of G-U-P and friends, and I don't want to
apps because this is wrong in some special case I didn't try.

Also testing from anyone with an RDMA app that does anything at all
fancy with memory allocation or registration would be helpful.

Thanks!

PS Let me know if I didn't go on long enough about this one-line patch
and I can write some more.

 drivers/infiniband/core/umem.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 71f0c0f..fb5abd3 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		ret = get_user_pages(current, current->mm, cur_base,
 				     min_t(unsigned long, npages,
 					   PAGE_SIZE / sizeof (struct page *)),
-				     1, !umem->writable, page_list, vma_list);
+				     umem->writable, 0, page_list, vma_list);
 
 		if (ret < 0)
 			goto out;
-- 
1.7.8.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-01-26  5:59 [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters Roland Dreier
@ 2012-01-26 20:01 ` Hugh Dickins
  2012-01-26 22:45   ` Roland Dreier
  0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2012-01-26 20:01 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

On Wed, 25 Jan 2012, Roland Dreier wrote:
> From: Roland Dreier <roland@purestorage.com>
> 
> Right now, we always pass write==1 to get_user_pages(), even when we
> only intend to read the memory.  We pass force==1 if we're going for
> read-only and force==0 if we want writable.  The reasoning behind this
> seems to be contained in this out-of-tree changelog from 2005:
> 
>     Always ask get_user_pages() for writable pages, but pass force=1
>     if the consumer has only asked for read-only pages.  This fixes a
>     problem registering memory that has just been allocated but not
>     touched yet, while allowing registration of read-only memory to
>     continue to work.
> 
> However, I don't think the mm works like this today, and indeed GUP
> will fault in pages for an untouched read-only mapping just fine with
> write and force set to 0.  In fact, always passing 1 for write causes
> problems with modern kernels, because we end up hitting the "early
> C-O-W break" case in __do_fault(), even for read-only mappings where
> this makes no sense.
> 
> Signed-off-by: Roland Dreier <roland@purestorage.com>

I'd love to say go ahead with this, it looks so obviously correct...
... so obviously correct that it made me wonder why on earth the old
code was writing for read, and how to make make sense of the paragraph
you very helpfully quote above.

I think this is largely about the ZERO_PAGE.  If you do a read fault
on an untouched anonymous area, it maps in the ZERO_PAGE, and will
only give you your own private zeroed page when there's a write fault
to touch it.

I think your ib_umem_get() is making sure to give the process its own
private zeroed page: if the area is PROT_READ, MAP_PRIVATE, userspace
will not be wanting to write into it, but presumably it expects to see
data placed in that page by the underlying driver, and it would be very
bad news if the driver wrote its data into the ZERO_PAGE.

We did go through a phase (2.6.24 through 2.6.31) of not using the
ZERO_PAGE, but that presented its own problems, so it was reinstated.

So, ugly as it looks, I think you have to stick with write=1 force=!write.
Does that make sense now?

And although the ZERO_PAGE gives the most vivid example, I think it goes
beyond that to the wider issue of pages COWed after fork() - GUPping in
this way fixes the easy cases (and I've no desire again to go down that
rathole of fixing the most general cases).

Hugh

> ---
> This patch comes from me trying to do userspace RDMA on a memory
> region exported from a character driver and mapped with
> 
>     mmap(... PROT_READ, MAP_PRIVATE ...)
> 
> The character driver has a trivial mmap method that just sets vm_ops
> and and equally trivial fault method that essentially just does
> 
>     vmf->page = vmalloc_to_page(buf + (vmf->pgoff << PAGE_SHIFT));
> 
> ie the most elementary way to export a vmalloc'ed buffer to userspace.
> 
> However, when I tried doing
> 
>     ibv_reg_mr(... IBV_ACCESS_REMOTE_READ ...)
> 
> in userspace on that mmap region, I found that COW was happening and
> so neither userspace nor the registered memory ended up pointing at
> the kernel buffer anymore, exactly because of the COW in __do_fault()
> I mention in the changelog above.
> 
> The patch below fixes my test case, and doesn't seem to break any of
> the ibverbs examples and other simple tests of userspace verbs that I
> tried.  But that's far from an exhaustive test suite.
> 
> I'd definitely appreciate comments from MM experts here, since I'm not
> positive of my understand of G-U-P and friends, and I don't want to
> apps because this is wrong in some special case I didn't try.
> 
> Also testing from anyone with an RDMA app that does anything at all
> fancy with memory allocation or registration would be helpful.
> 
> Thanks!
> 
> PS Let me know if I didn't go on long enough about this one-line patch
> and I can write some more.
> 
>  drivers/infiniband/core/umem.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 71f0c0f..fb5abd3 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>  		ret = get_user_pages(current, current->mm, cur_base,
>  				     min_t(unsigned long, npages,
>  					   PAGE_SIZE / sizeof (struct page *)),
> -				     1, !umem->writable, page_list, vma_list);
> +				     umem->writable, 0, page_list, vma_list);
>  
>  		if (ret < 0)
>  			goto out;
> -- 
> 1.7.8.3
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-01-26 20:01 ` Hugh Dickins
@ 2012-01-26 22:45   ` Roland Dreier
  2012-01-27 17:28     ` Roland Dreier
  2012-01-28  2:19     ` Hugh Dickins
  0 siblings, 2 replies; 17+ messages in thread
From: Roland Dreier @ 2012-01-26 22:45 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

Hi Hugh,

Thanks for the thoughtful answer...

> I think this is largely about the ZERO_PAGE.  If you do a read fault
> on an untouched anonymous area, it maps in the ZERO_PAGE, and will
> only give you your own private zeroed page when there's a write fault
> to touch it.
>
> I think your ib_umem_get() is making sure to give the process its own
> private zeroed page: if the area is PROT_READ, MAP_PRIVATE, userspace
> will not be wanting to write into it, but presumably it expects to see
> data placed in that page by the underlying driver, and it would be very
> bad news if the driver wrote its data into the ZERO_PAGE.

I think we are actually OK.  If umem->writable == 0, that is actually
a promise by the driver/HW that they are not going to write to this
memory.  Mapping ZERO_PAGE to the hardware is fine in this case,
since the hardware will just read zeroes exactly as it should.

One question is whether we're OK if userspace maps some
anonymous memory with PROT_WRITE but doesn't touch it,
and then tries to map it to the hardware read-only.  In that case
we hit get_user_pages() with write == 0.  If I understand the code
correctly, we end up mapping ZERO_PAGE in do_anonymous_page().

But then if userspace writes to this anonymous memory, a COW
will be triggered and the hardware will be left holding a different
page than the one that is mapped into userspace (ie the device
won't read what userspace writes).  Kind of the inverse of the
problem I hit.

I don't have a good understanding of what force == 1 means -- I
guess the question is what happens if userspace tells us to
write to a read-only mapping that userspace could have mapped
writable?

> And although the ZERO_PAGE gives the most vivid example, I think it goes
> beyond that to the wider issue of pages COWed after fork() - GUPping in
> this way fixes the easy cases (and I've no desire again to go down that
> rathole of fixing the most general cases).

For IB / RDMA we kind of explicitly give up on COW after fork().
But I guess I don't know what issue you're thinking of.  Is it
similar to what I described above?  In other words, we have
a readable mapping that we'll COW on a write fault, but the driver
is only following the mapping for reading and so a COW will
mess us up.

Sigh, what a mess ... it seems what we really want to do is know
if userspace might trigger a COW because or not, and only do a
preemptive COW in that case.  (We're not really concerned with
userspace fork()ing and setting up a COW in the future, since that's
what we have MADV_DONTFORK for)

The status quo works for userspace anonymous mappings but
it doesn't work for my case of mapping a kernel buffer read-only
into userspace.  And fixing my case breaks the anonymous case.
Do you see a way out of this dilemma?  Do we need to add yet
another flag to get_user_pages()?

Thanks!
  Roland

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-01-26 22:45   ` Roland Dreier
@ 2012-01-27 17:28     ` Roland Dreier
  2012-01-28  2:31       ` Hugh Dickins
  2012-01-28  2:19     ` Hugh Dickins
  1 sibling, 1 reply; 17+ messages in thread
From: Roland Dreier @ 2012-01-27 17:28 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

> Sigh, what a mess ... it seems what we really want to do is know
> if userspace might trigger a COW because or not, and only do a
> preemptive COW in that case.  (We're not really concerned with
> userspace fork()ing and setting up a COW in the future, since that's
> what we have MADV_DONTFORK for)
>
> The status quo works for userspace anonymous mappings but
> it doesn't work for my case of mapping a kernel buffer read-only
> into userspace.  And fixing my case breaks the anonymous case.
> Do you see a way out of this dilemma?  Do we need to add yet
> another flag to get_user_pages()?

So thinking about this a bit more... it seems what we want is at least
to first order that we do the equivalent of write==1 exactly when the vma
for a mapping has VM_WRITE set (or is it VMA_MAYWRITE / force==1?
I don't quite understand the distinction between WRITE and MAYWRITE).

Right now, one call to get_user_pages() might involve more than one vma,
but we could simulate the above by doing find_vma() and making sure our
call to get_user_pages() goes one vma at a time.  Of course that would be
inefficient since get_user_pages() will redo the find_vma() internally, so it
would I guess make sense to add another FOLL_ flag to tell
get_user_pages() to do this?

Am I all wet, or am I becoming an MM hacker?

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-01-26 22:45   ` Roland Dreier
  2012-01-27 17:28     ` Roland Dreier
@ 2012-01-28  2:19     ` Hugh Dickins
  2012-01-30 19:16       ` Roland Dreier
  1 sibling, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2012-01-28  2:19 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 7586 bytes --]

On Thu, 26 Jan 2012, Roland Dreier wrote:
> 
> Thanks for the thoughtful answer...

But I should have paid more attention to reading what you had written.
In particular to:

> > > This patch comes from me trying to do userspace RDMA on a memory
> > > region exported from a character driver and mapped with
> > > 
> > >     mmap(... PROT_READ, MAP_PRIVATE ...)

Why MAP_PRIVATE?  There you are explicitly asking for COW: okay,
you wouldn't normally expect any COW while it's just PROT_READ, but
once you bring GUP into the picture, with use of write and force,
then you are just begging for COW with that MAP_PRIVATE.  Please
change it to MAP_SHARED - any reason why not?

I feel you're trying to handle two very different cases (rdma into
user-supplied anonymous memory, and exporting driver memory to the
user) with the same set of args to get_user_pages().  In fact, I
don't even see why you need get_user_pages() at all when exporting
driver memory to the user.  Ah, perhaps you don't, but you do want
your standard access method (which already involves GUP) not to
mess up when applied to such a mapping - is that it?

> 
> > I think this is largely about the ZERO_PAGE.  If you do a read fault
> > on an untouched anonymous area, it maps in the ZERO_PAGE, and will
> > only give you your own private zeroed page when there's a write fault
> > to touch it.
> >
> > I think your ib_umem_get() is making sure to give the process its own
> > private zeroed page: if the area is PROT_READ, MAP_PRIVATE, userspace
> > will not be wanting to write into it, but presumably it expects to see
> > data placed in that page by the underlying driver, and it would be very
> > bad news if the driver wrote its data into the ZERO_PAGE.
> 
> I think we are actually OK.  If umem->writable == 0, that is actually
> a promise by the driver/HW that they are not going to write to this
> memory.  Mapping ZERO_PAGE to the hardware is fine in this case,
> since the hardware will just read zeroes exactly as it should.

Right, I got your use of the write flag the wrong way round
(it represents the needs of the driver end, not of the user end),
but we arrive at much the same issue.

> 
> One question is whether we're OK if userspace maps some
> anonymous memory with PROT_WRITE but doesn't touch it,
> and then tries to map it to the hardware read-only.  In that case
> we hit get_user_pages() with write == 0.  If I understand the code
> correctly, we end up mapping ZERO_PAGE in do_anonymous_page().
> 
> But then if userspace writes to this anonymous memory, a COW
> will be triggered and the hardware will be left holding a different
> page than the one that is mapped into userspace (ie the device
> won't read what userspace writes).  Kind of the inverse of the
> problem I hit.

Yes.

> 
> I don't have a good understanding of what force == 1 means -- I
> guess the question is what happens if userspace tells us to
> write to a read-only mapping that userspace could have mapped
> writable?

The force flag allows GUP to override the current protections of the vma:
by default, GUP refuses to write to an area which was not mapped PROT_WRITE,
and refuses to read from an area which is mapped PROT_NONE; but sometimes
we want it to go ahead despite those current protections - in your case,
you want the driver to be able to write into an area which the user cannot.

I believe the force flag originated for ptrace, and in particular,
to allow a breakpoint to be written into an area of read-only text.

You ask in other mail about VM_MAYREAD and VM_MAYWRITE, their relation
to VM_READ and VM_WRITE: they show the permissions you are permitted to
add via mprotect.  So although you may not have VM_WRITE on an area at
present, if VM_MAYWRITE is set then you can later mprotect it PROT_WRITE.
And "force" takes those into consideration, refusing to write to an area
for which you don't have MAYWRITE permission.

But here it gets very weird.  Although write=1 force=1 checks MAYWRITE
(and you need to have opened the file for writing to have MAYWRITE on a
SHARED mmap), the fault which GUP write=1 force=1 generates on a currently
readonly shared mapping actually causes COW - you end up with an anonymous
page inside the shared mapping.  Nick and I hated that, and tried to argue
Linus out of it, but he dug his heels in; and realistically it's many years
too late to change.  I assume it originated as an additional layer of
protection, to prevent someone setting breakpoints, who happened to have
opened the object RW instead of RO, from corrupting their object file on
disk.  So although shared-write-force demands you have write permission
on the file, it ensures you don't write to it (unless VM_WRITE also set).

I don't like even mentioning that behaviour, it's aberrant, and everywhere
ignored; but I once did an audit, and didn't find anything so seriously
wrong as to merit complicating the code all over.

But I have to mention it to you now, because just as I suggest above that
your mmap should be MAP_SHARED not MAP_PRIVATE, I might want to suggest
that you should open the mmaped file RW; yet I fear that this aberrant
behaviour will defeat that anyway.  Something to try to see if it helps.

(In preparing to answer you, I got excited to find what looked like
an anomaly within the aberration: although you can fairly easily see
do_wp_page() falling through to COW in the case above, it looked to me
as if the "early C-O-W break" code in __do_fault() does not.  But in
practice it appears to, and eventually I realized that maybe_mkwrite
does not set the pte_write bit in this case, which causes GUP's next
follow_page to fail, so it goes to handle_mm_fault again, which this
time will go the do_wp_page way, implementing the behaviour described.)

> 
> > And although the ZERO_PAGE gives the most vivid example, I think it goes
> > beyond that to the wider issue of pages COWed after fork() - GUPping in
> > this way fixes the easy cases (and I've no desire again to go down that
> > rathole of fixing the most general cases).
> 
> For IB / RDMA we kind of explicitly give up on COW after fork().
> But I guess I don't know what issue you're thinking of.  Is it
> similar to what I described above?  In other words, we have
> a readable mapping that we'll COW on a write fault, but the driver
> is only following the mapping for reading and so a COW will
> mess us up.

That's it; but you say that you give up on COW after fork(), just
as I say that I don't want to go down that rathole again.  There are
others who disagree, but you and I are on the same page (haha) there.

> 
> Sigh, what a mess ... it seems what we really want to do is know
> if userspace might trigger a COW because or not, and only do a
> preemptive COW in that case.  (We're not really concerned with
> userspace fork()ing and setting up a COW in the future, since that's
> what we have MADV_DONTFORK for)
> 
> The status quo works for userspace anonymous mappings but
> it doesn't work for my case of mapping a kernel buffer read-only
> into userspace.  And fixing my case breaks the anonymous case.
> Do you see a way out of this dilemma?  Do we need to add yet
> another flag to get_user_pages()?

I would prefer to avoid that, but it's always a possibility.
Of course, by new flag to __get_user_pages rather than more args
to get_user_pages.  But right now I'm still unclear why you're
doing get_user_pages at all in the mapped device case.

Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-01-27 17:28     ` Roland Dreier
@ 2012-01-28  2:31       ` Hugh Dickins
  2012-01-28 19:25         ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2012-01-28  2:31 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2193 bytes --]

On Fri, 27 Jan 2012, Roland Dreier wrote:

> > Sigh, what a mess ... it seems what we really want to do is know
> > if userspace might trigger a COW because or not, and only do a
> > preemptive COW in that case.  (We're not really concerned with
> > userspace fork()ing and setting up a COW in the future, since that's
> > what we have MADV_DONTFORK for)
> >
> > The status quo works for userspace anonymous mappings but
> > it doesn't work for my case of mapping a kernel buffer read-only
> > into userspace.  And fixing my case breaks the anonymous case.
> > Do you see a way out of this dilemma?  Do we need to add yet
> > another flag to get_user_pages()?
> 
> So thinking about this a bit more... it seems what we want is at least
> to first order that we do the equivalent of write==1 exactly when the vma
> for a mapping has VM_WRITE set

My first impression is that that's not what you want at all: that will
not do a preliminary COW of anonymous page to be written into by the
driver when the user only wants VM_READ access.  But perhaps I'm
worrying about the second order while you're sticking to first order.

Or perhaps I'm misunderstanding the context in which you want to do
this: are you now accepting to do a different get_user_pages in the
anonymous and driver-memory cases, and this suggestion was for the
driver-memory case only?

> (or is it VMA_MAYWRITE / force==1?
> I don't quite understand the distinction between WRITE and MAYWRITE).

I may have told you more than you wanted to know in the other mail.

> 
> Right now, one call to get_user_pages() might involve more than one vma,
> but we could simulate the above by doing find_vma() and making sure our
> call to get_user_pages() goes one vma at a time.  Of course that would be
> inefficient since get_user_pages() will redo the find_vma() internally, so it
> would I guess make sense to add another FOLL_ flag to tell
> get_user_pages() to do this?

I cannot go further, without explanation for why you need get_user_pages
in the driver-memory case at all.

> 
> Am I all wet, or am I becoming an MM hacker?

Certainly more than I'll ever be an RDMA hacker,

Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-01-28  2:31       ` Hugh Dickins
@ 2012-01-28 19:25         ` Jason Gunthorpe
  2012-01-30 19:19           ` Roland Dreier
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2012-01-28 19:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Roland Dreier, linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

On Fri, Jan 27, 2012 at 06:31:07PM -0800, Hugh Dickins wrote:

> My first impression is that that's not what you want at all: that will
> not do a preliminary COW of anonymous page to be written into by the
> driver when the user only wants VM_READ access.  But perhaps I'm
> worrying about the second order while you're sticking to first order.

IMHO, in this instance, the RDMA driver should not violate the
mprotect flags of the page, ie if you ask to register memory for
RDMA WRITE that the process cannot write to, that should be denied.

I know accessing system memory (eg obtained via mmap on
/sys/bus/pci/devices/0000:00:02.0/resource0) has been asked for in the
past, and IIRC, the problem was that some of the common code, (GUP?)
errored on these maps. I don't know if Roland's case is similar.

The main point (at least before) was to have a uniform userspace API
for memory registration that worked on any address range mapped into
the process no matter where it came from. Currently the API just calls
GUP unconditionally...

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-01-28  2:19     ` Hugh Dickins
@ 2012-01-30 19:16       ` Roland Dreier
  2012-01-30 20:20         ` Andrea Arcangeli
  2012-01-30 20:34         ` Hugh Dickins
  0 siblings, 2 replies; 17+ messages in thread
From: Roland Dreier @ 2012-01-30 19:16 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

On Fri, Jan 27, 2012 at 6:19 PM, Hugh Dickins <hughd@google.com> wrote:
>> > > This patch comes from me trying to do userspace RDMA on a memory
>> > > region exported from a character driver and mapped with
>> > >
>> > >     mmap(... PROT_READ, MAP_PRIVATE ...)
>
> Why MAP_PRIVATE?  There you are explicitly asking for COW: okay,
> you wouldn't normally expect any COW while it's just PROT_READ, but
> once you bring GUP into the picture, with use of write and force,
> then you are just begging for COW with that MAP_PRIVATE.  Please
> change it to MAP_SHARED - any reason why not?

I have no idea of the history there... probably could be changed with
no problems.

However, get_user_pages has this comment:

 * @force:	whether to force write access even if user mapping is
 *		readonly. This will result in the page being COWed even
 *		in MAP_SHARED mappings. You do not want this.

but I don't see where in the code FOLL_FORCE does COW
for MAP_SHARED mappings.  But on the OTOH I don't see
why we set force in the first place.  Why wouldn't we respect
the user's mapping permissions.

> I feel you're trying to handle two very different cases (rdma into
> user-supplied anonymous memory, and exporting driver memory to the
> user) with the same set of args to get_user_pages().  In fact, I
> don't even see why you need get_user_pages() at all when exporting
> driver memory to the user.  Ah, perhaps you don't, but you do want
> your standard access method (which already involves GUP) not to
> mess up when applied to such a mapping - is that it?

Exactly.  Right now we have the libibverbs userspace API, which
basically lets userspace create an abstract "memory region" (MR)
that is then given to the RDMA hardware to do IO on.  Userspace does

    mr = ibv_reg_mr(..., buf, size, access_flags);

where access flags say whether we're going to let the hardware
read and/or write the memory.

Ideally userspace should not have to know where the memory
underlying its "buf" came from or what type of mapping it is.

Certainly there are still more unresolved issues around the case
where userspace wants to map, say, part of a GPUs PCI memory
(which won't have any underlying page structs at all), but I'm at
least hoping we can come up with a way to handle both anonymous
private maps (which will be COWed from the zero page when
the memory is touched for writing) and shared mappings of kernel
memory exported by a driver's mmap method.


So I guess I'm left thinking that it seems at least plausible that
what we want is a new FOLL_ flag for __get_user_pages() that triggers
COW exactly on the pages that userspace might trigger COW on,
and avoids COW otherwise -- ie do FOLL_WRITE exactly for the
pages that have VM_WRITE in their mapping.

I don't think we want to do the "force" semantics or deal with the
VM_MAYWRITE possiblity -- the access we give the hardware on
behalf of userspace should just match the access that userspace
actually has.  It seems that if we don't try to get pages for writing
when VM_WRITE isn't set, we don't need force anymore.

 - R.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-01-28 19:25         ` Jason Gunthorpe
@ 2012-01-30 19:19           ` Roland Dreier
  0 siblings, 0 replies; 17+ messages in thread
From: Roland Dreier @ 2012-01-30 19:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hugh Dickins, linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

On Sat, Jan 28, 2012 at 11:25 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> I know accessing system memory (eg obtained via mmap on
> /sys/bus/pci/devices/0000:00:02.0/resource0) has been asked for in the
> past, and IIRC, the problem was that some of the common code, (GUP?)
> errored on these maps. I don't know if Roland's case is similar.

I think the problem there is that this is done via remap_pfn_range()
or similar, and the mapping has no underlying pages at all.  So we
would need a new interface that gives us different information for
such cases.

This is quite a bit trickier since I don't think the DMA API even has
a way to express getting a "device A" bus address for some memory
that is in a BAR for "device B".  So I'm not trying to address this case
(yet).  First I'd like to deal with as many flavors of page-backed
mappings as I can.

 - R.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-01-30 19:16       ` Roland Dreier
@ 2012-01-30 20:20         ` Andrea Arcangeli
  2012-02-06 17:46           ` Roland Dreier
  2012-01-30 20:34         ` Hugh Dickins
  1 sibling, 1 reply; 17+ messages in thread
From: Andrea Arcangeli @ 2012-01-30 20:20 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Hugh Dickins, linux-rdma, linux-mm, linux-kernel

Hi Roland,

On Mon, Jan 30, 2012 at 11:16:04AM -0800, Roland Dreier wrote:
> I have no idea of the history there... probably could be changed with
> no problems.
> 
> However, get_user_pages has this comment:
> 
>  * @force:	whether to force write access even if user mapping is
>  *		readonly. This will result in the page being COWed even
>  *		in MAP_SHARED mappings. You do not want this.
> 
> but I don't see where in the code FOLL_FORCE does COW
> for MAP_SHARED mappings.  But on the OTOH I don't see
> why we set force in the first place.  Why wouldn't we respect
> the user's mapping permissions.

I think it does, because pte_write cannot set if VM_WRITE is not set,
so then do_wp_page will be called if only VM_MAYWRITE is set.

force should only be needed by gdb, to modify .text, nothing else.

But for your usage, or anything that doesn't need to go around
userland "robusteness" permission for debugging purposes, it shouldn't
be needed.

> > I feel you're trying to handle two very different cases (rdma into
> > user-supplied anonymous memory, and exporting driver memory to the
> > user) with the same set of args to get_user_pages().  In fact, I
> > don't even see why you need get_user_pages() at all when exporting
> > driver memory to the user.  Ah, perhaps you don't, but you do want
> > your standard access method (which already involves GUP) not to
> > mess up when applied to such a mapping - is that it?
> 
> Exactly.  Right now we have the libibverbs userspace API, which
> basically lets userspace create an abstract "memory region" (MR)
> that is then given to the RDMA hardware to do IO on.  Userspace does
> 
>     mr = ibv_reg_mr(..., buf, size, access_flags);
> 
> where access flags say whether we're going to let the hardware
> read and/or write the memory.
> 
> Ideally userspace should not have to know where the memory
> underlying its "buf" came from or what type of mapping it is.
> 
> Certainly there are still more unresolved issues around the case
> where userspace wants to map, say, part of a GPUs PCI memory
> (which won't have any underlying page structs at all), but I'm at
> least hoping we can come up with a way to handle both anonymous
> private maps (which will be COWed from the zero page when
> the memory is touched for writing) and shared mappings of kernel
> memory exported by a driver's mmap method.

If you map it with an mmap(PROT_READ|PROT_WRITE), force or not force
won't change a thing in terms of cows. Just make sure you map your
control memory right, then safely remove force=1 and you won't get the
control page cowed by mistake. Then if you map it with MAP_SHARED it
won't be mapped read-only by fork() (leading to either parent or child
losing the control on the device), Hugh already suggested you to use
MAP_SHARED instead of MAP_PRIVATE.

Also I'm assuming the control memory is in ram not mmio space because
gup shouldn't mess with mmio space (which should be mapped with
VM_PFN_MAP and VM_IO in fact, to allow gup abort before it touches
anything).

> So I guess I'm left thinking that it seems at least plausible that
> what we want is a new FOLL_ flag for __get_user_pages() that triggers
> COW exactly on the pages that userspace might trigger COW on,
> and avoids COW otherwise -- ie do FOLL_WRITE exactly for the
> pages that have VM_WRITE in their mapping.

Well, that's the force=0 behavior, it should be like userland.

force bypass it and allows to write in a way that should instead
-EFAULT if it was userland writing to it (but for example in .text, it
doesn't overwrite pagecache of the binary, or it'd screw every other
instance of the executable, but it cows it so it then can modify .text
for gdb).

If you don't use force, the cowing behavior should be identical to
userland (except you'll get -EFAULT as retval, instead of
of a sigsegv raised).

> I don't think we want to do the "force" semantics or deal with the
> VM_MAYWRITE possiblity -- the access we give the hardware on
> behalf of userspace should just match the access that userspace
> actually has.  It seems that if we don't try to get pages for writing
> when VM_WRITE isn't set, we don't need force anymore.

Agreed. I guess you just need to map the control buffer in a way that
if userland writes to it, won't cow it? So then if gup tries to write
to it, it won't cow it? Still it sounds weird that gup writes to the
control buffer of the card, maybe you should do a check on the pages
returned by gup and verify they're not the control buffer before
allowing DMA to those?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-01-30 19:16       ` Roland Dreier
  2012-01-30 20:20         ` Andrea Arcangeli
@ 2012-01-30 20:34         ` Hugh Dickins
  2012-02-06 17:39           ` Roland Dreier
  1 sibling, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2012-01-30 20:34 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4318 bytes --]

On Mon, 30 Jan 2012, Roland Dreier wrote:
> On Fri, Jan 27, 2012 at 6:19 PM, Hugh Dickins <hughd@google.com> wrote:
> >> > > This patch comes from me trying to do userspace RDMA on a memory
> >> > > region exported from a character driver and mapped with
> >> > >
> >> > >     mmap(... PROT_READ, MAP_PRIVATE ...)
> >
> > Why MAP_PRIVATE?  There you are explicitly asking for COW: okay,
> > you wouldn't normally expect any COW while it's just PROT_READ, but
> > once you bring GUP into the picture, with use of write and force,
> > then you are just begging for COW with that MAP_PRIVATE.  Please
> > change it to MAP_SHARED - any reason why not?
> 
> I have no idea of the history there... probably could be changed with
> no problems.
> 
> However, get_user_pages has this comment:
> 
>  * @force:	whether to force write access even if user mapping is
>  *		readonly. This will result in the page being COWed even
>  *		in MAP_SHARED mappings. You do not want this.
> 
> but I don't see where in the code FOLL_FORCE does COW
> for MAP_SHARED mappings.

It ends up being implemented by the conditionals do_wp_page().

> But on the OTOH I don't see
> why we set force in the first place.  Why wouldn't we respect
> the user's mapping permissions.
> 
> > I feel you're trying to handle two very different cases (rdma into
> > user-supplied anonymous memory, and exporting driver memory to the
> > user) with the same set of args to get_user_pages().  In fact, I
> > don't even see why you need get_user_pages() at all when exporting
> > driver memory to the user.  Ah, perhaps you don't, but you do want
> > your standard access method (which already involves GUP) not to
> > mess up when applied to such a mapping - is that it?
> 
> Exactly.  Right now we have the libibverbs userspace API, which
> basically lets userspace create an abstract "memory region" (MR)
> that is then given to the RDMA hardware to do IO on.  Userspace does
> 
>     mr = ibv_reg_mr(..., buf, size, access_flags);
> 
> where access flags say whether we're going to let the hardware
> read and/or write the memory.
> 
> Ideally userspace should not have to know where the memory
> underlying its "buf" came from or what type of mapping it is.
> 
> Certainly there are still more unresolved issues around the case
> where userspace wants to map, say, part of a GPUs PCI memory
> (which won't have any underlying page structs at all), but I'm at
> least hoping we can come up with a way to handle both anonymous
> private maps (which will be COWed from the zero page when
> the memory is touched for writing) and shared mappings of kernel
> memory exported by a driver's mmap method.
> 
> 
> So I guess I'm left thinking that it seems at least plausible that
> what we want is a new FOLL_ flag for __get_user_pages() that triggers
> COW exactly on the pages that userspace might trigger COW on,
> and avoids COW otherwise -- ie do FOLL_WRITE exactly for the
> pages that have VM_WRITE in their mapping.

The hardest part about implementing that is deciding what snappy
name to give the FOLL_ flag.

And it's a comprehensible semantic which sounds good and plausible,
but I'm afraid we might be deceiving ourselves.

> 
> I don't think we want to do the "force" semantics or deal with the
> VM_MAYWRITE possiblity -- the access we give the hardware on
> behalf of userspace should just match the access that userspace
> actually has.  It seems that if we don't try to get pages for writing
> when VM_WRITE isn't set, we don't need force anymore.

I suspect you never needed or wanted the weird force behaviour on
shared maywrite, but that you did need the force COW behaviour on
private currently-unwritable maywrite.  You (or your forebears)
defined that interface to use the force flag, I'm guessing it was
for a reason; now you want to change it not to use the force flag,
and it sounds good, but I'm afraid you'll discover down the line
what the force flag was for.

Can you, for example, enforce the permissions set up by the user?
I mean, if they do the ibv_reg_mr() on a private readonly area,
so __get_user_pages with the FOLL_APPROPRIATELY flag will fault
in ZERO_PAGEs, can you enforce that RDMA will never spray data
into those pages?

Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-01-30 20:34         ` Hugh Dickins
@ 2012-02-06 17:39           ` Roland Dreier
  2012-02-07 20:39             ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Roland Dreier @ 2012-02-06 17:39 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

Sorry for the slow reply, I got caught in other business...

On Mon, Jan 30, 2012 at 12:34 PM, Hugh Dickins <hughd@google.com> wrote:
> The hardest part about implementing that is deciding what snappy
> name to give the FOLL_ flag.

Yes... FOLL_SOFT_COW ?  FOLL_READONLY_COW ?
(plus a good comment explaining it I guess)

>> I don't think we want to do the "force" semantics or deal with the
>> VM_MAYWRITE possiblity -- the access we give the hardware on
>> behalf of userspace should just match the access that userspace
>> actually has.  It seems that if we don't try to get pages for writing
>> when VM_WRITE isn't set, we don't need force anymore.
>
> I suspect you never needed or wanted the weird force behaviour on
> shared maywrite, but that you did need the force COW behaviour on
> private currently-unwritable maywrite.  You (or your forebears)
> defined that interface to use the force flag, I'm guessing it was
> for a reason; now you want to change it not to use the force flag,
> and it sounds good, but I'm afraid you'll discover down the line
> what the force flag was for.

Actually I think I understand why the original code passed !write
as the force parameter.

If the user is registering memory with read-only access, there are
two common cases.  Possibly the underlying memory really has
a read-only mapping, but probably more often it is just an ordinary
buffer allocated in userspace with malloc() or the like.

In the second case, it's quite likely we have a read/write mapping
of anonymous pages.  We'll expose it read-only for RDMA but the
userspace process will write data into the memory via ordinary CPU
access.  However, if we do ibv_reg_mr() before initializing the memory
it's quite possible that the mapping actually points to the zero page,
waiting for a CPU write to trigger a COW.

So in the second case, doing GUP without the write flag will leave
the COW untriggered, and we'll end up mapping the zero page to
the hardware, and RDMA won't read the data that userspace actually
writes.  So (without GUP extension as we're discussing in this thread)
we're forced to pass write==1 to GUP, even if we expect hardware
to only do reads.

But if we pass write==1, then GUP on the first case (mapping that
is genuinely read-only) will fail, unless we pass force==1 too.  But
this should only succeed if we're going to only access the memory
read-only, so we should set force to !writable-access-by-rdma.

Which I think explains why the code is the way it is.  But clearly
we could do better if we had a better way of telling GUP our real
intentions -- ie the FOLL_READONLY_COW flag.

> Can you, for example, enforce the permissions set up by the user?
> I mean, if they do the ibv_reg_mr() on a private readonly area,
> so __get_user_pages with the FOLL_APPROPRIATELY flag will fault
> in ZERO_PAGEs, can you enforce that RDMA will never spray data
> into those pages?

Yes, the access flags passed into ibv_reg_mr() are enforced by
the RDMA hardware, so if no write access is request, no write
access is possible.

And presumably if we do GUP with write==1, force==0 that will
fail on a read-only mapping?

 - R.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-01-30 20:20         ` Andrea Arcangeli
@ 2012-02-06 17:46           ` Roland Dreier
  0 siblings, 0 replies; 17+ messages in thread
From: Roland Dreier @ 2012-02-06 17:46 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Hugh Dickins, linux-rdma, linux-mm, linux-kernel

Hi Andrea, sorry for the slow reply, had to work on other stuff for a bit.

On Mon, Jan 30, 2012 at 12:20 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> If you map it with an mmap(PROT_READ|PROT_WRITE), force or not force
> won't change a thing in terms of cows. Just make sure you map your
> control memory right, then safely remove force=1 and you won't get the
> control page cowed by mistake. Then if you map it with MAP_SHARED it
> won't be mapped read-only by fork() (leading to either parent or child
> losing the control on the device), Hugh already suggested you to use
> MAP_SHARED instead of MAP_PRIVATE.

Actually this isn't about control memory for the RDMA adapter...
as you mentioned that typically is MMIO and mapped with remap_pfn
stuff, without using any GUP stuff.

I'm talking about the registration of other memory for reading/writing
by a remote system via RDMA.

The reason I'm talking about exporting kernel memory is that I wanted
to do a debugging trick where a kernel module exposed some state
into an mmap'able buffer.  And I wanted to be able to read that state
even if my broken module killed the whole system (in fact exactly
when things crash I want to be able to read the state to figure out
why I crashed!).

So I wrote a trivial userspace program that does nothing but mmap
the buffer, accept RDMA connections from remote systems, and
map the buffer for reading over those connections.  Then I can have
a second system that connects to that process and polls the buffer.

Because all the RDMA state is setup in advance, I can keep polling
even after the first system panics.  It's sort of like that firewire remote
debugging, except I only get access to a limited memory buffer.

The only difficulty is the problem that started this thread, ie a bogus
COW so the remote system ends up polling the wrong pages.  So with
my original patch, I'm able to debug but I guess we agree it's the
wrong fix for the general problem, and I'll write up a patch that adds
what I think is the correct fix (the new FOLL flag) soon.

 - R.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-02-06 17:39           ` Roland Dreier
@ 2012-02-07 20:39             ` Hugh Dickins
  2012-02-08 23:10               ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2012-02-07 20:39 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3859 bytes --]

On Mon, 6 Feb 2012, Roland Dreier wrote:
> Sorry for the slow reply, I got caught in other business...

Negative problem!

> 
> On Mon, Jan 30, 2012 at 12:34 PM, Hugh Dickins <hughd@google.com> wrote:
> > The hardest part about implementing that is deciding what snappy
> > name to give the FOLL_ flag.
> 
> Yes... FOLL_SOFT_COW ?  FOLL_READONLY_COW ?
> (plus a good comment explaining it I guess)

I don't grok either of those - surely not READONLY_COW.

FOLL_PREPARE is the closest I've got.  I'd have said FOLL_TOUCH
if that weren't already taken.  FOLL_WRITE_IF_ABLE?

> 
> >> I don't think we want to do the "force" semantics or deal with the
> >> VM_MAYWRITE possiblity -- the access we give the hardware on
> >> behalf of userspace should just match the access that userspace
> >> actually has.  It seems that if we don't try to get pages for writing
> >> when VM_WRITE isn't set, we don't need force anymore.
> >
> > I suspect you never needed or wanted the weird force behaviour on
> > shared maywrite, but that you did need the force COW behaviour on
> > private currently-unwritable maywrite.  You (or your forebears)
> > defined that interface to use the force flag, I'm guessing it was
> > for a reason; now you want to change it not to use the force flag,
> > and it sounds good, but I'm afraid you'll discover down the line
> > what the force flag was for.
> 
> Actually I think I understand why the original code passed !write
> as the force parameter.
> 
> If the user is registering memory with read-only access, there are
> two common cases.  Possibly the underlying memory really has
> a read-only mapping, but probably more often it is just an ordinary
> buffer allocated in userspace with malloc() or the like.
> 
> In the second case, it's quite likely we have a read/write mapping
> of anonymous pages.  We'll expose it read-only for RDMA but the
> userspace process will write data into the memory via ordinary CPU
> access.  However, if we do ibv_reg_mr() before initializing the memory
> it's quite possible that the mapping actually points to the zero page,
> waiting for a CPU write to trigger a COW.
> 
> So in the second case, doing GUP without the write flag will leave
> the COW untriggered, and we'll end up mapping the zero page to
> the hardware, and RDMA won't read the data that userspace actually
> writes.  So (without GUP extension as we're discussing in this thread)
> we're forced to pass write==1 to GUP, even if we expect hardware
> to only do reads.
> 
> But if we pass write==1, then GUP on the first case (mapping that
> is genuinely read-only) will fail, unless we pass force==1 too.  But
> this should only succeed if we're going to only access the memory
> read-only, so we should set force to !writable-access-by-rdma.
> 
> Which I think explains why the code is the way it is.  But clearly
> we could do better if we had a better way of telling GUP our real
> intentions -- ie the FOLL_READONLY_COW flag.

You've persuaded me.  Yes, you have been using force because that was
the only tool available at the time, to get close to the sensible
behaviour you are now asking for.

> 
> > Can you, for example, enforce the permissions set up by the user?
> > I mean, if they do the ibv_reg_mr() on a private readonly area,
> > so __get_user_pages with the FOLL_APPROPRIATELY flag will fault
> > in ZERO_PAGEs, can you enforce that RDMA will never spray data
> > into those pages?
> 
> Yes, the access flags passed into ibv_reg_mr() are enforced by
> the RDMA hardware, so if no write access is request, no write
> access is possible.

Okay, if you enforce the agreed permissions in hardware, that's fine.

> 
> And presumably if we do GUP with write==1, force==0 that will
> fail on a read-only mapping?

That's right.

Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-02-07 20:39             ` Hugh Dickins
@ 2012-02-08 23:10               ` Hugh Dickins
  2012-02-09 17:50                 ` Roland Dreier
  0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2012-02-08 23:10 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

On Tue, 7 Feb 2012, Hugh Dickins wrote:
> On Mon, 6 Feb 2012, Roland Dreier wrote:
> > Which I think explains why the code is the way it is.  But clearly
> > we could do better if we had a better way of telling GUP our real
> > intentions -- ie the FOLL_READONLY_COW flag.
> 
> You've persuaded me.  Yes, you have been using force because that was
> the only tool available at the time, to get close to the sensible
> behaviour you are now asking for.
> 
> > 
> > > Can you, for example, enforce the permissions set up by the user?
> > > I mean, if they do the ibv_reg_mr() on a private readonly area,
> > > so __get_user_pages with the FOLL_APPROPRIATELY flag will fault
> > > in ZERO_PAGEs, can you enforce that RDMA will never spray data
> > > into those pages?
> > 
> > Yes, the access flags passed into ibv_reg_mr() are enforced by
> > the RDMA hardware, so if no write access is request, no write
> > access is possible.
> 
> Okay, if you enforce the agreed permissions in hardware, that's fine.

A doubt assaulted me overnight: sorry, I'm back to not understanding.

What are these access flags passed into ibv_reg_mr() that are enforced?
What relation do they bear to what you will pass to __get_user_pages()?

You are asking for a FOLL_FOLLOW ("follow permissions of the vma") flag,
which automatically works for read-write access to a VM_READ|VM_WRITE vma,
but read-only access to a VM_READ-only vma, without you having to know
which permission applies to which range of memory in the area specified.

But you don't need that new flag to set up read-only access, and if you
use that new flag to set up read-write access to an area which happens to
contain VM_READ-only ranges, you have set it up to write into ZERO_PAGEs.

?Hugh?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-02-08 23:10               ` Hugh Dickins
@ 2012-02-09 17:50                 ` Roland Dreier
  2012-02-09 22:57                   ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Roland Dreier @ 2012-02-09 17:50 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

On Wed, Feb 8, 2012 at 3:10 PM, Hugh Dickins <hughd@google.com> wrote:
> A doubt assaulted me overnight: sorry, I'm back to not understanding.
>
> What are these access flags passed into ibv_reg_mr() that are enforced?
> What relation do they bear to what you will pass to __get_user_pages()?

The access flags are:

enum ibv_access_flags {
        IBV_ACCESS_LOCAL_WRITE          = 1,
        IBV_ACCESS_REMOTE_WRITE         = (1<<1),
        IBV_ACCESS_REMOTE_READ          = (1<<2),
        IBV_ACCESS_REMOTE_ATOMIC        = (1<<3),
        IBV_ACCESS_MW_BIND              = (1<<4)
};

pretty much the only one of interest is IBV_ACCESS_REMOTE_READ --
all the others imply the possibility of RDMA HW writing to the page.

So basically if any flags other than IBV_ACCESS_REMOTE_READ are
set, we pass FOLL_WRITE to __get_user_pages(), otherwise we pass
the new FOLL_FOLLOW.  [does "Marcia, Marcia, Marcia" mean anything
to a Brit? ;)]

ie the change from the status quo would be:

[read-only]  write=1, force=1 --> FOLL_FOLLOW
[writeable]  wrote=1, force=0 --> FOLL_WRITE (equivalent)

> You are asking for a FOLL_FOLLOW ("follow permissions of the vma") flag,
> which automatically works for read-write access to a VM_READ|VM_WRITE vma,
> but read-only access to a VM_READ-only vma, without you having to know
> which permission applies to which range of memory in the area specified.

> But you don't need that new flag to set up read-only access, and if you
> use that new flag to set up read-write access to an area which happens to
> contain VM_READ-only ranges, you have set it up to write into ZERO_PAGEs.

First of all, I kind of like FOLL_FOLLOW as the name :)

Now you're confusing me: I think we do need FOLL_FOLLOW to
set up read-only access -- we want to trigger the COWs that userspace
might trigger by touching the memory up front.  This is to handle
a case like

    [userspace]
    int *buf = malloc(16 * 4096);
    // buf now points to 16 anonymous zero_pages
    mr = ibv_reg_mr(pd, buf, 16 * 4096, IBV_ACCESS_REMOTE_READ);
    // RDMA HW will only ever read buf, but...
    buf[0] = 2012;
    // COW triggered, first page of buf changed, RDMA HW has wrong mapping!

For something the RDMA HW might write to, then I agree we don't want
FOLL_FOLLOW -- we just would use FOLL_WRITE as we currently do.

When I get around to coding this up, I think I'm going to spend a lot
of time on the comments and on the commit log :)

 - R.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters
  2012-02-09 17:50                 ` Roland Dreier
@ 2012-02-09 22:57                   ` Hugh Dickins
  0 siblings, 0 replies; 17+ messages in thread
From: Hugh Dickins @ 2012-02-09 22:57 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel

On Thu, 9 Feb 2012, Roland Dreier wrote:
> On Wed, Feb 8, 2012 at 3:10 PM, Hugh Dickins <hughd@google.com> wrote:
> > A doubt assaulted me overnight: sorry, I'm back to not understanding.
> >
> > What are these access flags passed into ibv_reg_mr() that are enforced?
> > What relation do they bear to what you will pass to __get_user_pages()?
> 
> The access flags are:
> 
> enum ibv_access_flags {
>         IBV_ACCESS_LOCAL_WRITE          = 1,
>         IBV_ACCESS_REMOTE_WRITE         = (1<<1),
>         IBV_ACCESS_REMOTE_READ          = (1<<2),
>         IBV_ACCESS_REMOTE_ATOMIC        = (1<<3),
>         IBV_ACCESS_MW_BIND              = (1<<4)
> };
> 
> pretty much the only one of interest is IBV_ACCESS_REMOTE_READ --
> all the others imply the possibility of RDMA HW writing to the page.
> 
> So basically if any flags other than IBV_ACCESS_REMOTE_READ are
> set, we pass FOLL_WRITE to __get_user_pages(), otherwise we pass
> the new FOLL_FOLLOW.  [does "Marcia, Marcia, Marcia" mean anything
> to a Brit? ;)]

[ Nothing whatsoever - I needed to avoid saying "Zilch" there, didn't I?
- I had to look her up.  Not sure quite how she comes in here, if you're
implying that someone is perfect, I rather doubt you're thinking of me!
I was thrilled a year ago at last to discover who Virginia is,
celebrated in mm/memory.c and mm/page-writeback.c. ]

> 
> ie the change from the status quo would be:
> 
> [read-only]  write=1, force=1 --> FOLL_FOLLOW
> [writeable]  wrote=1, force=0 --> FOLL_WRITE (equivalent)
> 
> > You are asking for a FOLL_FOLLOW ("follow permissions of the vma") flag,
> > which automatically works for read-write access to a VM_READ|VM_WRITE vma,
> > but read-only access to a VM_READ-only vma, without you having to know
> > which permission applies to which range of memory in the area specified.
> 
> > But you don't need that new flag to set up read-only access, and if you
> > use that new flag to set up read-write access to an area which happens to
> > contain VM_READ-only ranges, you have set it up to write into ZERO_PAGEs.
> 
> First of all, I kind of like FOLL_FOLLOW as the name :)

Yeah, it's not too bad; though below I'm now wondering if it is appropriate.

> 
> Now you're confusing me:

I'm very glad to hear it, I feel less alone.

> I think we do need FOLL_FOLLOW to
> set up read-only access -- we want to trigger the COWs that userspace
> might trigger by touching the memory up front.  This is to handle
> a case like
> 
>     [userspace]
>     int *buf = malloc(16 * 4096);
>     // buf now points to 16 anonymous zero_pages
>     mr = ibv_reg_mr(pd, buf, 16 * 4096, IBV_ACCESS_REMOTE_READ);
>     // RDMA HW will only ever read buf, but...
>     buf[0] = 2012;
>     // COW triggered, first page of buf changed, RDMA HW has wrong mapping!
> 
> For something the RDMA HW might write to, then I agree we don't want
> FOLL_FOLLOW -- we just would use FOLL_WRITE as we currently do.

Ah, okay, something earlier in the thread had thrown me off that track,
I thought we were expecting the ibv_reg_mr to give the remote the same
permissions as the user had.  Or something, maybe I'm just making excuses
for being dense.

But then I wonder if FOLL_FOLLOW is actually the behaviour you need.

Imagine a PROT_READ MAP_PRIVATE area (just as in your original mail):
what if the user does mprotect PROT_READ|PROT_WRITE on that afterwards,
and then proceeds to touch it.  The old write=1 force=1 GUP would have
pre-COWed that and no problem, but FOLL_FOLLOW will not.

Maybe you can answer "don't do that"; but you do then appear to be
trading one kind of "don't do that" for another.  Maybe it depends on
what libraries might get up to: aren't there (debug? garbage collection?)
memalloc libraries which give out memory protected until you touch it?

Maybe you need FOLL_PRECOW, which does write=1 force=1 on the private
areas, but just faults in the shared areas (avoiding the bizarre forced
COW on shared areas).

> 
> When I get around to coding this up, I think I'm going to spend a lot
> of time on the comments and on the commit log :)

I am sorry to be driving you to such effort, honestly.

Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-02-09 22:57 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-26  5:59 [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters Roland Dreier
2012-01-26 20:01 ` Hugh Dickins
2012-01-26 22:45   ` Roland Dreier
2012-01-27 17:28     ` Roland Dreier
2012-01-28  2:31       ` Hugh Dickins
2012-01-28 19:25         ` Jason Gunthorpe
2012-01-30 19:19           ` Roland Dreier
2012-01-28  2:19     ` Hugh Dickins
2012-01-30 19:16       ` Roland Dreier
2012-01-30 20:20         ` Andrea Arcangeli
2012-02-06 17:46           ` Roland Dreier
2012-01-30 20:34         ` Hugh Dickins
2012-02-06 17:39           ` Roland Dreier
2012-02-07 20:39             ` Hugh Dickins
2012-02-08 23:10               ` Hugh Dickins
2012-02-09 17:50                 ` Roland Dreier
2012-02-09 22:57                   ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).