* [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters @ 2012-01-26 5:59 Roland Dreier 2012-01-26 20:01 ` Hugh Dickins 0 siblings, 1 reply; 17+ messages in thread From: Roland Dreier @ 2012-01-26 5:59 UTC (permalink / raw) To: linux-rdma; +Cc: Andrea Arcangeli, Hugh Dickins, linux-mm, linux-kernel From: Roland Dreier <roland@purestorage.com> Right now, we always pass write==1 to get_user_pages(), even when we only intend to read the memory. We pass force==1 if we're going for read-only and force==0 if we want writable. The reasoning behind this seems to be contained in this out-of-tree changelog from 2005: Always ask get_user_pages() for writable pages, but pass force=1 if the consumer has only asked for read-only pages. This fixes a problem registering memory that has just been allocated but not touched yet, while allowing registration of read-only memory to continue to work. However, I don't think the mm works like this today, and indeed GUP will fault in pages for an untouched read-only mapping just fine with write and force set to 0. In fact, always passing 1 for write causes problems with modern kernels, because we end up hitting the "early C-O-W break" case in __do_fault(), even for read-only mappings where this makes no sense. Signed-off-by: Roland Dreier <roland@purestorage.com> --- This patch comes from me trying to do userspace RDMA on a memory region exported from a character driver and mapped with mmap(... PROT_READ, MAP_PRIVATE ...) The character driver has a trivial mmap method that just sets vm_ops and and equally trivial fault method that essentially just does vmf->page = vmalloc_to_page(buf + (vmf->pgoff << PAGE_SHIFT)); ie the most elementary way to export a vmalloc'ed buffer to userspace. However, when I tried doing ibv_reg_mr(... IBV_ACCESS_REMOTE_READ ...) in userspace on that mmap region, I found that COW was happening and so neither userspace nor the registered memory ended up pointing at the kernel buffer anymore, exactly because of the COW in __do_fault() I mention in the changelog above. The patch below fixes my test case, and doesn't seem to break any of the ibverbs examples and other simple tests of userspace verbs that I tried. But that's far from an exhaustive test suite. I'd definitely appreciate comments from MM experts here, since I'm not positive of my understand of G-U-P and friends, and I don't want to apps because this is wrong in some special case I didn't try. Also testing from anyone with an RDMA app that does anything at all fancy with memory allocation or registration would be helpful. Thanks! PS Let me know if I didn't go on long enough about this one-line patch and I can write some more. drivers/infiniband/core/umem.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 71f0c0f..fb5abd3 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, ret = get_user_pages(current, current->mm, cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), - 1, !umem->writable, page_list, vma_list); + umem->writable, 0, page_list, vma_list); if (ret < 0) goto out; -- 1.7.8.3 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-01-26 5:59 [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters Roland Dreier @ 2012-01-26 20:01 ` Hugh Dickins 2012-01-26 22:45 ` Roland Dreier 0 siblings, 1 reply; 17+ messages in thread From: Hugh Dickins @ 2012-01-26 20:01 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel On Wed, 25 Jan 2012, Roland Dreier wrote: > From: Roland Dreier <roland@purestorage.com> > > Right now, we always pass write==1 to get_user_pages(), even when we > only intend to read the memory. We pass force==1 if we're going for > read-only and force==0 if we want writable. The reasoning behind this > seems to be contained in this out-of-tree changelog from 2005: > > Always ask get_user_pages() for writable pages, but pass force=1 > if the consumer has only asked for read-only pages. This fixes a > problem registering memory that has just been allocated but not > touched yet, while allowing registration of read-only memory to > continue to work. > > However, I don't think the mm works like this today, and indeed GUP > will fault in pages for an untouched read-only mapping just fine with > write and force set to 0. In fact, always passing 1 for write causes > problems with modern kernels, because we end up hitting the "early > C-O-W break" case in __do_fault(), even for read-only mappings where > this makes no sense. > > Signed-off-by: Roland Dreier <roland@purestorage.com> I'd love to say go ahead with this, it looks so obviously correct... ... so obviously correct that it made me wonder why on earth the old code was writing for read, and how to make make sense of the paragraph you very helpfully quote above. I think this is largely about the ZERO_PAGE. If you do a read fault on an untouched anonymous area, it maps in the ZERO_PAGE, and will only give you your own private zeroed page when there's a write fault to touch it. I think your ib_umem_get() is making sure to give the process its own private zeroed page: if the area is PROT_READ, MAP_PRIVATE, userspace will not be wanting to write into it, but presumably it expects to see data placed in that page by the underlying driver, and it would be very bad news if the driver wrote its data into the ZERO_PAGE. We did go through a phase (2.6.24 through 2.6.31) of not using the ZERO_PAGE, but that presented its own problems, so it was reinstated. So, ugly as it looks, I think you have to stick with write=1 force=!write. Does that make sense now? And although the ZERO_PAGE gives the most vivid example, I think it goes beyond that to the wider issue of pages COWed after fork() - GUPping in this way fixes the easy cases (and I've no desire again to go down that rathole of fixing the most general cases). Hugh > --- > This patch comes from me trying to do userspace RDMA on a memory > region exported from a character driver and mapped with > > mmap(... PROT_READ, MAP_PRIVATE ...) > > The character driver has a trivial mmap method that just sets vm_ops > and and equally trivial fault method that essentially just does > > vmf->page = vmalloc_to_page(buf + (vmf->pgoff << PAGE_SHIFT)); > > ie the most elementary way to export a vmalloc'ed buffer to userspace. > > However, when I tried doing > > ibv_reg_mr(... IBV_ACCESS_REMOTE_READ ...) > > in userspace on that mmap region, I found that COW was happening and > so neither userspace nor the registered memory ended up pointing at > the kernel buffer anymore, exactly because of the COW in __do_fault() > I mention in the changelog above. > > The patch below fixes my test case, and doesn't seem to break any of > the ibverbs examples and other simple tests of userspace verbs that I > tried. But that's far from an exhaustive test suite. > > I'd definitely appreciate comments from MM experts here, since I'm not > positive of my understand of G-U-P and friends, and I don't want to > apps because this is wrong in some special case I didn't try. > > Also testing from anyone with an RDMA app that does anything at all > fancy with memory allocation or registration would be helpful. > > Thanks! > > PS Let me know if I didn't go on long enough about this one-line patch > and I can write some more. > > drivers/infiniband/core/umem.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c > index 71f0c0f..fb5abd3 100644 > --- a/drivers/infiniband/core/umem.c > +++ b/drivers/infiniband/core/umem.c > @@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, > ret = get_user_pages(current, current->mm, cur_base, > min_t(unsigned long, npages, > PAGE_SIZE / sizeof (struct page *)), > - 1, !umem->writable, page_list, vma_list); > + umem->writable, 0, page_list, vma_list); > > if (ret < 0) > goto out; > -- > 1.7.8.3 > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-01-26 20:01 ` Hugh Dickins @ 2012-01-26 22:45 ` Roland Dreier 2012-01-27 17:28 ` Roland Dreier 2012-01-28 2:19 ` Hugh Dickins 0 siblings, 2 replies; 17+ messages in thread From: Roland Dreier @ 2012-01-26 22:45 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel Hi Hugh, Thanks for the thoughtful answer... > I think this is largely about the ZERO_PAGE. If you do a read fault > on an untouched anonymous area, it maps in the ZERO_PAGE, and will > only give you your own private zeroed page when there's a write fault > to touch it. > > I think your ib_umem_get() is making sure to give the process its own > private zeroed page: if the area is PROT_READ, MAP_PRIVATE, userspace > will not be wanting to write into it, but presumably it expects to see > data placed in that page by the underlying driver, and it would be very > bad news if the driver wrote its data into the ZERO_PAGE. I think we are actually OK. If umem->writable == 0, that is actually a promise by the driver/HW that they are not going to write to this memory. Mapping ZERO_PAGE to the hardware is fine in this case, since the hardware will just read zeroes exactly as it should. One question is whether we're OK if userspace maps some anonymous memory with PROT_WRITE but doesn't touch it, and then tries to map it to the hardware read-only. In that case we hit get_user_pages() with write == 0. If I understand the code correctly, we end up mapping ZERO_PAGE in do_anonymous_page(). But then if userspace writes to this anonymous memory, a COW will be triggered and the hardware will be left holding a different page than the one that is mapped into userspace (ie the device won't read what userspace writes). Kind of the inverse of the problem I hit. I don't have a good understanding of what force == 1 means -- I guess the question is what happens if userspace tells us to write to a read-only mapping that userspace could have mapped writable? > And although the ZERO_PAGE gives the most vivid example, I think it goes > beyond that to the wider issue of pages COWed after fork() - GUPping in > this way fixes the easy cases (and I've no desire again to go down that > rathole of fixing the most general cases). For IB / RDMA we kind of explicitly give up on COW after fork(). But I guess I don't know what issue you're thinking of. Is it similar to what I described above? In other words, we have a readable mapping that we'll COW on a write fault, but the driver is only following the mapping for reading and so a COW will mess us up. Sigh, what a mess ... it seems what we really want to do is know if userspace might trigger a COW because or not, and only do a preemptive COW in that case. (We're not really concerned with userspace fork()ing and setting up a COW in the future, since that's what we have MADV_DONTFORK for) The status quo works for userspace anonymous mappings but it doesn't work for my case of mapping a kernel buffer read-only into userspace. And fixing my case breaks the anonymous case. Do you see a way out of this dilemma? Do we need to add yet another flag to get_user_pages()? Thanks! Roland ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-01-26 22:45 ` Roland Dreier @ 2012-01-27 17:28 ` Roland Dreier 2012-01-28 2:31 ` Hugh Dickins 2012-01-28 2:19 ` Hugh Dickins 1 sibling, 1 reply; 17+ messages in thread From: Roland Dreier @ 2012-01-27 17:28 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel > Sigh, what a mess ... it seems what we really want to do is know > if userspace might trigger a COW because or not, and only do a > preemptive COW in that case. (We're not really concerned with > userspace fork()ing and setting up a COW in the future, since that's > what we have MADV_DONTFORK for) > > The status quo works for userspace anonymous mappings but > it doesn't work for my case of mapping a kernel buffer read-only > into userspace. And fixing my case breaks the anonymous case. > Do you see a way out of this dilemma? Do we need to add yet > another flag to get_user_pages()? So thinking about this a bit more... it seems what we want is at least to first order that we do the equivalent of write==1 exactly when the vma for a mapping has VM_WRITE set (or is it VMA_MAYWRITE / force==1? I don't quite understand the distinction between WRITE and MAYWRITE). Right now, one call to get_user_pages() might involve more than one vma, but we could simulate the above by doing find_vma() and making sure our call to get_user_pages() goes one vma at a time. Of course that would be inefficient since get_user_pages() will redo the find_vma() internally, so it would I guess make sense to add another FOLL_ flag to tell get_user_pages() to do this? Am I all wet, or am I becoming an MM hacker? Thanks, Roland ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-01-27 17:28 ` Roland Dreier @ 2012-01-28 2:31 ` Hugh Dickins 2012-01-28 19:25 ` Jason Gunthorpe 0 siblings, 1 reply; 17+ messages in thread From: Hugh Dickins @ 2012-01-28 2:31 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 2193 bytes --] On Fri, 27 Jan 2012, Roland Dreier wrote: > > Sigh, what a mess ... it seems what we really want to do is know > > if userspace might trigger a COW because or not, and only do a > > preemptive COW in that case. (We're not really concerned with > > userspace fork()ing and setting up a COW in the future, since that's > > what we have MADV_DONTFORK for) > > > > The status quo works for userspace anonymous mappings but > > it doesn't work for my case of mapping a kernel buffer read-only > > into userspace. And fixing my case breaks the anonymous case. > > Do you see a way out of this dilemma? Do we need to add yet > > another flag to get_user_pages()? > > So thinking about this a bit more... it seems what we want is at least > to first order that we do the equivalent of write==1 exactly when the vma > for a mapping has VM_WRITE set My first impression is that that's not what you want at all: that will not do a preliminary COW of anonymous page to be written into by the driver when the user only wants VM_READ access. But perhaps I'm worrying about the second order while you're sticking to first order. Or perhaps I'm misunderstanding the context in which you want to do this: are you now accepting to do a different get_user_pages in the anonymous and driver-memory cases, and this suggestion was for the driver-memory case only? > (or is it VMA_MAYWRITE / force==1? > I don't quite understand the distinction between WRITE and MAYWRITE). I may have told you more than you wanted to know in the other mail. > > Right now, one call to get_user_pages() might involve more than one vma, > but we could simulate the above by doing find_vma() and making sure our > call to get_user_pages() goes one vma at a time. Of course that would be > inefficient since get_user_pages() will redo the find_vma() internally, so it > would I guess make sense to add another FOLL_ flag to tell > get_user_pages() to do this? I cannot go further, without explanation for why you need get_user_pages in the driver-memory case at all. > > Am I all wet, or am I becoming an MM hacker? Certainly more than I'll ever be an RDMA hacker, Hugh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-01-28 2:31 ` Hugh Dickins @ 2012-01-28 19:25 ` Jason Gunthorpe 2012-01-30 19:19 ` Roland Dreier 0 siblings, 1 reply; 17+ messages in thread From: Jason Gunthorpe @ 2012-01-28 19:25 UTC (permalink / raw) To: Hugh Dickins Cc: Roland Dreier, linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel On Fri, Jan 27, 2012 at 06:31:07PM -0800, Hugh Dickins wrote: > My first impression is that that's not what you want at all: that will > not do a preliminary COW of anonymous page to be written into by the > driver when the user only wants VM_READ access. But perhaps I'm > worrying about the second order while you're sticking to first order. IMHO, in this instance, the RDMA driver should not violate the mprotect flags of the page, ie if you ask to register memory for RDMA WRITE that the process cannot write to, that should be denied. I know accessing system memory (eg obtained via mmap on /sys/bus/pci/devices/0000:00:02.0/resource0) has been asked for in the past, and IIRC, the problem was that some of the common code, (GUP?) errored on these maps. I don't know if Roland's case is similar. The main point (at least before) was to have a uniform userspace API for memory registration that worked on any address range mapped into the process no matter where it came from. Currently the API just calls GUP unconditionally... Jason ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-01-28 19:25 ` Jason Gunthorpe @ 2012-01-30 19:19 ` Roland Dreier 0 siblings, 0 replies; 17+ messages in thread From: Roland Dreier @ 2012-01-30 19:19 UTC (permalink / raw) To: Jason Gunthorpe Cc: Hugh Dickins, linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel On Sat, Jan 28, 2012 at 11:25 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > I know accessing system memory (eg obtained via mmap on > /sys/bus/pci/devices/0000:00:02.0/resource0) has been asked for in the > past, and IIRC, the problem was that some of the common code, (GUP?) > errored on these maps. I don't know if Roland's case is similar. I think the problem there is that this is done via remap_pfn_range() or similar, and the mapping has no underlying pages at all. So we would need a new interface that gives us different information for such cases. This is quite a bit trickier since I don't think the DMA API even has a way to express getting a "device A" bus address for some memory that is in a BAR for "device B". So I'm not trying to address this case (yet). First I'd like to deal with as many flavors of page-backed mappings as I can. - R. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-01-26 22:45 ` Roland Dreier 2012-01-27 17:28 ` Roland Dreier @ 2012-01-28 2:19 ` Hugh Dickins 2012-01-30 19:16 ` Roland Dreier 1 sibling, 1 reply; 17+ messages in thread From: Hugh Dickins @ 2012-01-28 2:19 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 7586 bytes --] On Thu, 26 Jan 2012, Roland Dreier wrote: > > Thanks for the thoughtful answer... But I should have paid more attention to reading what you had written. In particular to: > > > This patch comes from me trying to do userspace RDMA on a memory > > > region exported from a character driver and mapped with > > > > > > mmap(... PROT_READ, MAP_PRIVATE ...) Why MAP_PRIVATE? There you are explicitly asking for COW: okay, you wouldn't normally expect any COW while it's just PROT_READ, but once you bring GUP into the picture, with use of write and force, then you are just begging for COW with that MAP_PRIVATE. Please change it to MAP_SHARED - any reason why not? I feel you're trying to handle two very different cases (rdma into user-supplied anonymous memory, and exporting driver memory to the user) with the same set of args to get_user_pages(). In fact, I don't even see why you need get_user_pages() at all when exporting driver memory to the user. Ah, perhaps you don't, but you do want your standard access method (which already involves GUP) not to mess up when applied to such a mapping - is that it? > > > I think this is largely about the ZERO_PAGE. If you do a read fault > > on an untouched anonymous area, it maps in the ZERO_PAGE, and will > > only give you your own private zeroed page when there's a write fault > > to touch it. > > > > I think your ib_umem_get() is making sure to give the process its own > > private zeroed page: if the area is PROT_READ, MAP_PRIVATE, userspace > > will not be wanting to write into it, but presumably it expects to see > > data placed in that page by the underlying driver, and it would be very > > bad news if the driver wrote its data into the ZERO_PAGE. > > I think we are actually OK. If umem->writable == 0, that is actually > a promise by the driver/HW that they are not going to write to this > memory. Mapping ZERO_PAGE to the hardware is fine in this case, > since the hardware will just read zeroes exactly as it should. Right, I got your use of the write flag the wrong way round (it represents the needs of the driver end, not of the user end), but we arrive at much the same issue. > > One question is whether we're OK if userspace maps some > anonymous memory with PROT_WRITE but doesn't touch it, > and then tries to map it to the hardware read-only. In that case > we hit get_user_pages() with write == 0. If I understand the code > correctly, we end up mapping ZERO_PAGE in do_anonymous_page(). > > But then if userspace writes to this anonymous memory, a COW > will be triggered and the hardware will be left holding a different > page than the one that is mapped into userspace (ie the device > won't read what userspace writes). Kind of the inverse of the > problem I hit. Yes. > > I don't have a good understanding of what force == 1 means -- I > guess the question is what happens if userspace tells us to > write to a read-only mapping that userspace could have mapped > writable? The force flag allows GUP to override the current protections of the vma: by default, GUP refuses to write to an area which was not mapped PROT_WRITE, and refuses to read from an area which is mapped PROT_NONE; but sometimes we want it to go ahead despite those current protections - in your case, you want the driver to be able to write into an area which the user cannot. I believe the force flag originated for ptrace, and in particular, to allow a breakpoint to be written into an area of read-only text. You ask in other mail about VM_MAYREAD and VM_MAYWRITE, their relation to VM_READ and VM_WRITE: they show the permissions you are permitted to add via mprotect. So although you may not have VM_WRITE on an area at present, if VM_MAYWRITE is set then you can later mprotect it PROT_WRITE. And "force" takes those into consideration, refusing to write to an area for which you don't have MAYWRITE permission. But here it gets very weird. Although write=1 force=1 checks MAYWRITE (and you need to have opened the file for writing to have MAYWRITE on a SHARED mmap), the fault which GUP write=1 force=1 generates on a currently readonly shared mapping actually causes COW - you end up with an anonymous page inside the shared mapping. Nick and I hated that, and tried to argue Linus out of it, but he dug his heels in; and realistically it's many years too late to change. I assume it originated as an additional layer of protection, to prevent someone setting breakpoints, who happened to have opened the object RW instead of RO, from corrupting their object file on disk. So although shared-write-force demands you have write permission on the file, it ensures you don't write to it (unless VM_WRITE also set). I don't like even mentioning that behaviour, it's aberrant, and everywhere ignored; but I once did an audit, and didn't find anything so seriously wrong as to merit complicating the code all over. But I have to mention it to you now, because just as I suggest above that your mmap should be MAP_SHARED not MAP_PRIVATE, I might want to suggest that you should open the mmaped file RW; yet I fear that this aberrant behaviour will defeat that anyway. Something to try to see if it helps. (In preparing to answer you, I got excited to find what looked like an anomaly within the aberration: although you can fairly easily see do_wp_page() falling through to COW in the case above, it looked to me as if the "early C-O-W break" code in __do_fault() does not. But in practice it appears to, and eventually I realized that maybe_mkwrite does not set the pte_write bit in this case, which causes GUP's next follow_page to fail, so it goes to handle_mm_fault again, which this time will go the do_wp_page way, implementing the behaviour described.) > > > And although the ZERO_PAGE gives the most vivid example, I think it goes > > beyond that to the wider issue of pages COWed after fork() - GUPping in > > this way fixes the easy cases (and I've no desire again to go down that > > rathole of fixing the most general cases). > > For IB / RDMA we kind of explicitly give up on COW after fork(). > But I guess I don't know what issue you're thinking of. Is it > similar to what I described above? In other words, we have > a readable mapping that we'll COW on a write fault, but the driver > is only following the mapping for reading and so a COW will > mess us up. That's it; but you say that you give up on COW after fork(), just as I say that I don't want to go down that rathole again. There are others who disagree, but you and I are on the same page (haha) there. > > Sigh, what a mess ... it seems what we really want to do is know > if userspace might trigger a COW because or not, and only do a > preemptive COW in that case. (We're not really concerned with > userspace fork()ing and setting up a COW in the future, since that's > what we have MADV_DONTFORK for) > > The status quo works for userspace anonymous mappings but > it doesn't work for my case of mapping a kernel buffer read-only > into userspace. And fixing my case breaks the anonymous case. > Do you see a way out of this dilemma? Do we need to add yet > another flag to get_user_pages()? I would prefer to avoid that, but it's always a possibility. Of course, by new flag to __get_user_pages rather than more args to get_user_pages. But right now I'm still unclear why you're doing get_user_pages at all in the mapped device case. Hugh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-01-28 2:19 ` Hugh Dickins @ 2012-01-30 19:16 ` Roland Dreier 2012-01-30 20:20 ` Andrea Arcangeli 2012-01-30 20:34 ` Hugh Dickins 0 siblings, 2 replies; 17+ messages in thread From: Roland Dreier @ 2012-01-30 19:16 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel On Fri, Jan 27, 2012 at 6:19 PM, Hugh Dickins <hughd@google.com> wrote: >> > > This patch comes from me trying to do userspace RDMA on a memory >> > > region exported from a character driver and mapped with >> > > >> > > mmap(... PROT_READ, MAP_PRIVATE ...) > > Why MAP_PRIVATE? There you are explicitly asking for COW: okay, > you wouldn't normally expect any COW while it's just PROT_READ, but > once you bring GUP into the picture, with use of write and force, > then you are just begging for COW with that MAP_PRIVATE. Please > change it to MAP_SHARED - any reason why not? I have no idea of the history there... probably could be changed with no problems. However, get_user_pages has this comment: * @force: whether to force write access even if user mapping is * readonly. This will result in the page being COWed even * in MAP_SHARED mappings. You do not want this. but I don't see where in the code FOLL_FORCE does COW for MAP_SHARED mappings. But on the OTOH I don't see why we set force in the first place. Why wouldn't we respect the user's mapping permissions. > I feel you're trying to handle two very different cases (rdma into > user-supplied anonymous memory, and exporting driver memory to the > user) with the same set of args to get_user_pages(). In fact, I > don't even see why you need get_user_pages() at all when exporting > driver memory to the user. Ah, perhaps you don't, but you do want > your standard access method (which already involves GUP) not to > mess up when applied to such a mapping - is that it? Exactly. Right now we have the libibverbs userspace API, which basically lets userspace create an abstract "memory region" (MR) that is then given to the RDMA hardware to do IO on. Userspace does mr = ibv_reg_mr(..., buf, size, access_flags); where access flags say whether we're going to let the hardware read and/or write the memory. Ideally userspace should not have to know where the memory underlying its "buf" came from or what type of mapping it is. Certainly there are still more unresolved issues around the case where userspace wants to map, say, part of a GPUs PCI memory (which won't have any underlying page structs at all), but I'm at least hoping we can come up with a way to handle both anonymous private maps (which will be COWed from the zero page when the memory is touched for writing) and shared mappings of kernel memory exported by a driver's mmap method. So I guess I'm left thinking that it seems at least plausible that what we want is a new FOLL_ flag for __get_user_pages() that triggers COW exactly on the pages that userspace might trigger COW on, and avoids COW otherwise -- ie do FOLL_WRITE exactly for the pages that have VM_WRITE in their mapping. I don't think we want to do the "force" semantics or deal with the VM_MAYWRITE possiblity -- the access we give the hardware on behalf of userspace should just match the access that userspace actually has. It seems that if we don't try to get pages for writing when VM_WRITE isn't set, we don't need force anymore. - R. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-01-30 19:16 ` Roland Dreier @ 2012-01-30 20:20 ` Andrea Arcangeli 2012-02-06 17:46 ` Roland Dreier 2012-01-30 20:34 ` Hugh Dickins 1 sibling, 1 reply; 17+ messages in thread From: Andrea Arcangeli @ 2012-01-30 20:20 UTC (permalink / raw) To: Roland Dreier; +Cc: Hugh Dickins, linux-rdma, linux-mm, linux-kernel Hi Roland, On Mon, Jan 30, 2012 at 11:16:04AM -0800, Roland Dreier wrote: > I have no idea of the history there... probably could be changed with > no problems. > > However, get_user_pages has this comment: > > * @force: whether to force write access even if user mapping is > * readonly. This will result in the page being COWed even > * in MAP_SHARED mappings. You do not want this. > > but I don't see where in the code FOLL_FORCE does COW > for MAP_SHARED mappings. But on the OTOH I don't see > why we set force in the first place. Why wouldn't we respect > the user's mapping permissions. I think it does, because pte_write cannot set if VM_WRITE is not set, so then do_wp_page will be called if only VM_MAYWRITE is set. force should only be needed by gdb, to modify .text, nothing else. But for your usage, or anything that doesn't need to go around userland "robusteness" permission for debugging purposes, it shouldn't be needed. > > I feel you're trying to handle two very different cases (rdma into > > user-supplied anonymous memory, and exporting driver memory to the > > user) with the same set of args to get_user_pages(). In fact, I > > don't even see why you need get_user_pages() at all when exporting > > driver memory to the user. Ah, perhaps you don't, but you do want > > your standard access method (which already involves GUP) not to > > mess up when applied to such a mapping - is that it? > > Exactly. Right now we have the libibverbs userspace API, which > basically lets userspace create an abstract "memory region" (MR) > that is then given to the RDMA hardware to do IO on. Userspace does > > mr = ibv_reg_mr(..., buf, size, access_flags); > > where access flags say whether we're going to let the hardware > read and/or write the memory. > > Ideally userspace should not have to know where the memory > underlying its "buf" came from or what type of mapping it is. > > Certainly there are still more unresolved issues around the case > where userspace wants to map, say, part of a GPUs PCI memory > (which won't have any underlying page structs at all), but I'm at > least hoping we can come up with a way to handle both anonymous > private maps (which will be COWed from the zero page when > the memory is touched for writing) and shared mappings of kernel > memory exported by a driver's mmap method. If you map it with an mmap(PROT_READ|PROT_WRITE), force or not force won't change a thing in terms of cows. Just make sure you map your control memory right, then safely remove force=1 and you won't get the control page cowed by mistake. Then if you map it with MAP_SHARED it won't be mapped read-only by fork() (leading to either parent or child losing the control on the device), Hugh already suggested you to use MAP_SHARED instead of MAP_PRIVATE. Also I'm assuming the control memory is in ram not mmio space because gup shouldn't mess with mmio space (which should be mapped with VM_PFN_MAP and VM_IO in fact, to allow gup abort before it touches anything). > So I guess I'm left thinking that it seems at least plausible that > what we want is a new FOLL_ flag for __get_user_pages() that triggers > COW exactly on the pages that userspace might trigger COW on, > and avoids COW otherwise -- ie do FOLL_WRITE exactly for the > pages that have VM_WRITE in their mapping. Well, that's the force=0 behavior, it should be like userland. force bypass it and allows to write in a way that should instead -EFAULT if it was userland writing to it (but for example in .text, it doesn't overwrite pagecache of the binary, or it'd screw every other instance of the executable, but it cows it so it then can modify .text for gdb). If you don't use force, the cowing behavior should be identical to userland (except you'll get -EFAULT as retval, instead of of a sigsegv raised). > I don't think we want to do the "force" semantics or deal with the > VM_MAYWRITE possiblity -- the access we give the hardware on > behalf of userspace should just match the access that userspace > actually has. It seems that if we don't try to get pages for writing > when VM_WRITE isn't set, we don't need force anymore. Agreed. I guess you just need to map the control buffer in a way that if userland writes to it, won't cow it? So then if gup tries to write to it, it won't cow it? Still it sounds weird that gup writes to the control buffer of the card, maybe you should do a check on the pages returned by gup and verify they're not the control buffer before allowing DMA to those? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-01-30 20:20 ` Andrea Arcangeli @ 2012-02-06 17:46 ` Roland Dreier 0 siblings, 0 replies; 17+ messages in thread From: Roland Dreier @ 2012-02-06 17:46 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Hugh Dickins, linux-rdma, linux-mm, linux-kernel Hi Andrea, sorry for the slow reply, had to work on other stuff for a bit. On Mon, Jan 30, 2012 at 12:20 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > If you map it with an mmap(PROT_READ|PROT_WRITE), force or not force > won't change a thing in terms of cows. Just make sure you map your > control memory right, then safely remove force=1 and you won't get the > control page cowed by mistake. Then if you map it with MAP_SHARED it > won't be mapped read-only by fork() (leading to either parent or child > losing the control on the device), Hugh already suggested you to use > MAP_SHARED instead of MAP_PRIVATE. Actually this isn't about control memory for the RDMA adapter... as you mentioned that typically is MMIO and mapped with remap_pfn stuff, without using any GUP stuff. I'm talking about the registration of other memory for reading/writing by a remote system via RDMA. The reason I'm talking about exporting kernel memory is that I wanted to do a debugging trick where a kernel module exposed some state into an mmap'able buffer. And I wanted to be able to read that state even if my broken module killed the whole system (in fact exactly when things crash I want to be able to read the state to figure out why I crashed!). So I wrote a trivial userspace program that does nothing but mmap the buffer, accept RDMA connections from remote systems, and map the buffer for reading over those connections. Then I can have a second system that connects to that process and polls the buffer. Because all the RDMA state is setup in advance, I can keep polling even after the first system panics. It's sort of like that firewire remote debugging, except I only get access to a limited memory buffer. The only difficulty is the problem that started this thread, ie a bogus COW so the remote system ends up polling the wrong pages. So with my original patch, I'm able to debug but I guess we agree it's the wrong fix for the general problem, and I'll write up a patch that adds what I think is the correct fix (the new FOLL flag) soon. - R. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-01-30 19:16 ` Roland Dreier 2012-01-30 20:20 ` Andrea Arcangeli @ 2012-01-30 20:34 ` Hugh Dickins 2012-02-06 17:39 ` Roland Dreier 1 sibling, 1 reply; 17+ messages in thread From: Hugh Dickins @ 2012-01-30 20:34 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 4318 bytes --] On Mon, 30 Jan 2012, Roland Dreier wrote: > On Fri, Jan 27, 2012 at 6:19 PM, Hugh Dickins <hughd@google.com> wrote: > >> > > This patch comes from me trying to do userspace RDMA on a memory > >> > > region exported from a character driver and mapped with > >> > > > >> > > mmap(... PROT_READ, MAP_PRIVATE ...) > > > > Why MAP_PRIVATE? There you are explicitly asking for COW: okay, > > you wouldn't normally expect any COW while it's just PROT_READ, but > > once you bring GUP into the picture, with use of write and force, > > then you are just begging for COW with that MAP_PRIVATE. Please > > change it to MAP_SHARED - any reason why not? > > I have no idea of the history there... probably could be changed with > no problems. > > However, get_user_pages has this comment: > > * @force: whether to force write access even if user mapping is > * readonly. This will result in the page being COWed even > * in MAP_SHARED mappings. You do not want this. > > but I don't see where in the code FOLL_FORCE does COW > for MAP_SHARED mappings. It ends up being implemented by the conditionals do_wp_page(). > But on the OTOH I don't see > why we set force in the first place. Why wouldn't we respect > the user's mapping permissions. > > > I feel you're trying to handle two very different cases (rdma into > > user-supplied anonymous memory, and exporting driver memory to the > > user) with the same set of args to get_user_pages(). In fact, I > > don't even see why you need get_user_pages() at all when exporting > > driver memory to the user. Ah, perhaps you don't, but you do want > > your standard access method (which already involves GUP) not to > > mess up when applied to such a mapping - is that it? > > Exactly. Right now we have the libibverbs userspace API, which > basically lets userspace create an abstract "memory region" (MR) > that is then given to the RDMA hardware to do IO on. Userspace does > > mr = ibv_reg_mr(..., buf, size, access_flags); > > where access flags say whether we're going to let the hardware > read and/or write the memory. > > Ideally userspace should not have to know where the memory > underlying its "buf" came from or what type of mapping it is. > > Certainly there are still more unresolved issues around the case > where userspace wants to map, say, part of a GPUs PCI memory > (which won't have any underlying page structs at all), but I'm at > least hoping we can come up with a way to handle both anonymous > private maps (which will be COWed from the zero page when > the memory is touched for writing) and shared mappings of kernel > memory exported by a driver's mmap method. > > > So I guess I'm left thinking that it seems at least plausible that > what we want is a new FOLL_ flag for __get_user_pages() that triggers > COW exactly on the pages that userspace might trigger COW on, > and avoids COW otherwise -- ie do FOLL_WRITE exactly for the > pages that have VM_WRITE in their mapping. The hardest part about implementing that is deciding what snappy name to give the FOLL_ flag. And it's a comprehensible semantic which sounds good and plausible, but I'm afraid we might be deceiving ourselves. > > I don't think we want to do the "force" semantics or deal with the > VM_MAYWRITE possiblity -- the access we give the hardware on > behalf of userspace should just match the access that userspace > actually has. It seems that if we don't try to get pages for writing > when VM_WRITE isn't set, we don't need force anymore. I suspect you never needed or wanted the weird force behaviour on shared maywrite, but that you did need the force COW behaviour on private currently-unwritable maywrite. You (or your forebears) defined that interface to use the force flag, I'm guessing it was for a reason; now you want to change it not to use the force flag, and it sounds good, but I'm afraid you'll discover down the line what the force flag was for. Can you, for example, enforce the permissions set up by the user? I mean, if they do the ibv_reg_mr() on a private readonly area, so __get_user_pages with the FOLL_APPROPRIATELY flag will fault in ZERO_PAGEs, can you enforce that RDMA will never spray data into those pages? Hugh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-01-30 20:34 ` Hugh Dickins @ 2012-02-06 17:39 ` Roland Dreier 2012-02-07 20:39 ` Hugh Dickins 0 siblings, 1 reply; 17+ messages in thread From: Roland Dreier @ 2012-02-06 17:39 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel Sorry for the slow reply, I got caught in other business... On Mon, Jan 30, 2012 at 12:34 PM, Hugh Dickins <hughd@google.com> wrote: > The hardest part about implementing that is deciding what snappy > name to give the FOLL_ flag. Yes... FOLL_SOFT_COW ? FOLL_READONLY_COW ? (plus a good comment explaining it I guess) >> I don't think we want to do the "force" semantics or deal with the >> VM_MAYWRITE possiblity -- the access we give the hardware on >> behalf of userspace should just match the access that userspace >> actually has. It seems that if we don't try to get pages for writing >> when VM_WRITE isn't set, we don't need force anymore. > > I suspect you never needed or wanted the weird force behaviour on > shared maywrite, but that you did need the force COW behaviour on > private currently-unwritable maywrite. You (or your forebears) > defined that interface to use the force flag, I'm guessing it was > for a reason; now you want to change it not to use the force flag, > and it sounds good, but I'm afraid you'll discover down the line > what the force flag was for. Actually I think I understand why the original code passed !write as the force parameter. If the user is registering memory with read-only access, there are two common cases. Possibly the underlying memory really has a read-only mapping, but probably more often it is just an ordinary buffer allocated in userspace with malloc() or the like. In the second case, it's quite likely we have a read/write mapping of anonymous pages. We'll expose it read-only for RDMA but the userspace process will write data into the memory via ordinary CPU access. However, if we do ibv_reg_mr() before initializing the memory it's quite possible that the mapping actually points to the zero page, waiting for a CPU write to trigger a COW. So in the second case, doing GUP without the write flag will leave the COW untriggered, and we'll end up mapping the zero page to the hardware, and RDMA won't read the data that userspace actually writes. So (without GUP extension as we're discussing in this thread) we're forced to pass write==1 to GUP, even if we expect hardware to only do reads. But if we pass write==1, then GUP on the first case (mapping that is genuinely read-only) will fail, unless we pass force==1 too. But this should only succeed if we're going to only access the memory read-only, so we should set force to !writable-access-by-rdma. Which I think explains why the code is the way it is. But clearly we could do better if we had a better way of telling GUP our real intentions -- ie the FOLL_READONLY_COW flag. > Can you, for example, enforce the permissions set up by the user? > I mean, if they do the ibv_reg_mr() on a private readonly area, > so __get_user_pages with the FOLL_APPROPRIATELY flag will fault > in ZERO_PAGEs, can you enforce that RDMA will never spray data > into those pages? Yes, the access flags passed into ibv_reg_mr() are enforced by the RDMA hardware, so if no write access is request, no write access is possible. And presumably if we do GUP with write==1, force==0 that will fail on a read-only mapping? - R. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-02-06 17:39 ` Roland Dreier @ 2012-02-07 20:39 ` Hugh Dickins 2012-02-08 23:10 ` Hugh Dickins 0 siblings, 1 reply; 17+ messages in thread From: Hugh Dickins @ 2012-02-07 20:39 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 3859 bytes --] On Mon, 6 Feb 2012, Roland Dreier wrote: > Sorry for the slow reply, I got caught in other business... Negative problem! > > On Mon, Jan 30, 2012 at 12:34 PM, Hugh Dickins <hughd@google.com> wrote: > > The hardest part about implementing that is deciding what snappy > > name to give the FOLL_ flag. > > Yes... FOLL_SOFT_COW ? FOLL_READONLY_COW ? > (plus a good comment explaining it I guess) I don't grok either of those - surely not READONLY_COW. FOLL_PREPARE is the closest I've got. I'd have said FOLL_TOUCH if that weren't already taken. FOLL_WRITE_IF_ABLE? > > >> I don't think we want to do the "force" semantics or deal with the > >> VM_MAYWRITE possiblity -- the access we give the hardware on > >> behalf of userspace should just match the access that userspace > >> actually has. It seems that if we don't try to get pages for writing > >> when VM_WRITE isn't set, we don't need force anymore. > > > > I suspect you never needed or wanted the weird force behaviour on > > shared maywrite, but that you did need the force COW behaviour on > > private currently-unwritable maywrite. You (or your forebears) > > defined that interface to use the force flag, I'm guessing it was > > for a reason; now you want to change it not to use the force flag, > > and it sounds good, but I'm afraid you'll discover down the line > > what the force flag was for. > > Actually I think I understand why the original code passed !write > as the force parameter. > > If the user is registering memory with read-only access, there are > two common cases. Possibly the underlying memory really has > a read-only mapping, but probably more often it is just an ordinary > buffer allocated in userspace with malloc() or the like. > > In the second case, it's quite likely we have a read/write mapping > of anonymous pages. We'll expose it read-only for RDMA but the > userspace process will write data into the memory via ordinary CPU > access. However, if we do ibv_reg_mr() before initializing the memory > it's quite possible that the mapping actually points to the zero page, > waiting for a CPU write to trigger a COW. > > So in the second case, doing GUP without the write flag will leave > the COW untriggered, and we'll end up mapping the zero page to > the hardware, and RDMA won't read the data that userspace actually > writes. So (without GUP extension as we're discussing in this thread) > we're forced to pass write==1 to GUP, even if we expect hardware > to only do reads. > > But if we pass write==1, then GUP on the first case (mapping that > is genuinely read-only) will fail, unless we pass force==1 too. But > this should only succeed if we're going to only access the memory > read-only, so we should set force to !writable-access-by-rdma. > > Which I think explains why the code is the way it is. But clearly > we could do better if we had a better way of telling GUP our real > intentions -- ie the FOLL_READONLY_COW flag. You've persuaded me. Yes, you have been using force because that was the only tool available at the time, to get close to the sensible behaviour you are now asking for. > > > Can you, for example, enforce the permissions set up by the user? > > I mean, if they do the ibv_reg_mr() on a private readonly area, > > so __get_user_pages with the FOLL_APPROPRIATELY flag will fault > > in ZERO_PAGEs, can you enforce that RDMA will never spray data > > into those pages? > > Yes, the access flags passed into ibv_reg_mr() are enforced by > the RDMA hardware, so if no write access is request, no write > access is possible. Okay, if you enforce the agreed permissions in hardware, that's fine. > > And presumably if we do GUP with write==1, force==0 that will > fail on a read-only mapping? That's right. Hugh ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-02-07 20:39 ` Hugh Dickins @ 2012-02-08 23:10 ` Hugh Dickins 2012-02-09 17:50 ` Roland Dreier 0 siblings, 1 reply; 17+ messages in thread From: Hugh Dickins @ 2012-02-08 23:10 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel On Tue, 7 Feb 2012, Hugh Dickins wrote: > On Mon, 6 Feb 2012, Roland Dreier wrote: > > Which I think explains why the code is the way it is. But clearly > > we could do better if we had a better way of telling GUP our real > > intentions -- ie the FOLL_READONLY_COW flag. > > You've persuaded me. Yes, you have been using force because that was > the only tool available at the time, to get close to the sensible > behaviour you are now asking for. > > > > > > Can you, for example, enforce the permissions set up by the user? > > > I mean, if they do the ibv_reg_mr() on a private readonly area, > > > so __get_user_pages with the FOLL_APPROPRIATELY flag will fault > > > in ZERO_PAGEs, can you enforce that RDMA will never spray data > > > into those pages? > > > > Yes, the access flags passed into ibv_reg_mr() are enforced by > > the RDMA hardware, so if no write access is request, no write > > access is possible. > > Okay, if you enforce the agreed permissions in hardware, that's fine. A doubt assaulted me overnight: sorry, I'm back to not understanding. What are these access flags passed into ibv_reg_mr() that are enforced? What relation do they bear to what you will pass to __get_user_pages()? You are asking for a FOLL_FOLLOW ("follow permissions of the vma") flag, which automatically works for read-write access to a VM_READ|VM_WRITE vma, but read-only access to a VM_READ-only vma, without you having to know which permission applies to which range of memory in the area specified. But you don't need that new flag to set up read-only access, and if you use that new flag to set up read-write access to an area which happens to contain VM_READ-only ranges, you have set it up to write into ZERO_PAGEs. ?Hugh? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-02-08 23:10 ` Hugh Dickins @ 2012-02-09 17:50 ` Roland Dreier 2012-02-09 22:57 ` Hugh Dickins 0 siblings, 1 reply; 17+ messages in thread From: Roland Dreier @ 2012-02-09 17:50 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel On Wed, Feb 8, 2012 at 3:10 PM, Hugh Dickins <hughd@google.com> wrote: > A doubt assaulted me overnight: sorry, I'm back to not understanding. > > What are these access flags passed into ibv_reg_mr() that are enforced? > What relation do they bear to what you will pass to __get_user_pages()? The access flags are: enum ibv_access_flags { IBV_ACCESS_LOCAL_WRITE = 1, IBV_ACCESS_REMOTE_WRITE = (1<<1), IBV_ACCESS_REMOTE_READ = (1<<2), IBV_ACCESS_REMOTE_ATOMIC = (1<<3), IBV_ACCESS_MW_BIND = (1<<4) }; pretty much the only one of interest is IBV_ACCESS_REMOTE_READ -- all the others imply the possibility of RDMA HW writing to the page. So basically if any flags other than IBV_ACCESS_REMOTE_READ are set, we pass FOLL_WRITE to __get_user_pages(), otherwise we pass the new FOLL_FOLLOW. [does "Marcia, Marcia, Marcia" mean anything to a Brit? ;)] ie the change from the status quo would be: [read-only] write=1, force=1 --> FOLL_FOLLOW [writeable] wrote=1, force=0 --> FOLL_WRITE (equivalent) > You are asking for a FOLL_FOLLOW ("follow permissions of the vma") flag, > which automatically works for read-write access to a VM_READ|VM_WRITE vma, > but read-only access to a VM_READ-only vma, without you having to know > which permission applies to which range of memory in the area specified. > But you don't need that new flag to set up read-only access, and if you > use that new flag to set up read-write access to an area which happens to > contain VM_READ-only ranges, you have set it up to write into ZERO_PAGEs. First of all, I kind of like FOLL_FOLLOW as the name :) Now you're confusing me: I think we do need FOLL_FOLLOW to set up read-only access -- we want to trigger the COWs that userspace might trigger by touching the memory up front. This is to handle a case like [userspace] int *buf = malloc(16 * 4096); // buf now points to 16 anonymous zero_pages mr = ibv_reg_mr(pd, buf, 16 * 4096, IBV_ACCESS_REMOTE_READ); // RDMA HW will only ever read buf, but... buf[0] = 2012; // COW triggered, first page of buf changed, RDMA HW has wrong mapping! For something the RDMA HW might write to, then I agree we don't want FOLL_FOLLOW -- we just would use FOLL_WRITE as we currently do. When I get around to coding this up, I think I'm going to spend a lot of time on the comments and on the commit log :) - R. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters 2012-02-09 17:50 ` Roland Dreier @ 2012-02-09 22:57 ` Hugh Dickins 0 siblings, 0 replies; 17+ messages in thread From: Hugh Dickins @ 2012-02-09 22:57 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma, Andrea Arcangeli, linux-mm, linux-kernel On Thu, 9 Feb 2012, Roland Dreier wrote: > On Wed, Feb 8, 2012 at 3:10 PM, Hugh Dickins <hughd@google.com> wrote: > > A doubt assaulted me overnight: sorry, I'm back to not understanding. > > > > What are these access flags passed into ibv_reg_mr() that are enforced? > > What relation do they bear to what you will pass to __get_user_pages()? > > The access flags are: > > enum ibv_access_flags { > IBV_ACCESS_LOCAL_WRITE = 1, > IBV_ACCESS_REMOTE_WRITE = (1<<1), > IBV_ACCESS_REMOTE_READ = (1<<2), > IBV_ACCESS_REMOTE_ATOMIC = (1<<3), > IBV_ACCESS_MW_BIND = (1<<4) > }; > > pretty much the only one of interest is IBV_ACCESS_REMOTE_READ -- > all the others imply the possibility of RDMA HW writing to the page. > > So basically if any flags other than IBV_ACCESS_REMOTE_READ are > set, we pass FOLL_WRITE to __get_user_pages(), otherwise we pass > the new FOLL_FOLLOW. [does "Marcia, Marcia, Marcia" mean anything > to a Brit? ;)] [ Nothing whatsoever - I needed to avoid saying "Zilch" there, didn't I? - I had to look her up. Not sure quite how she comes in here, if you're implying that someone is perfect, I rather doubt you're thinking of me! I was thrilled a year ago at last to discover who Virginia is, celebrated in mm/memory.c and mm/page-writeback.c. ] > > ie the change from the status quo would be: > > [read-only] write=1, force=1 --> FOLL_FOLLOW > [writeable] wrote=1, force=0 --> FOLL_WRITE (equivalent) > > > You are asking for a FOLL_FOLLOW ("follow permissions of the vma") flag, > > which automatically works for read-write access to a VM_READ|VM_WRITE vma, > > but read-only access to a VM_READ-only vma, without you having to know > > which permission applies to which range of memory in the area specified. > > > But you don't need that new flag to set up read-only access, and if you > > use that new flag to set up read-write access to an area which happens to > > contain VM_READ-only ranges, you have set it up to write into ZERO_PAGEs. > > First of all, I kind of like FOLL_FOLLOW as the name :) Yeah, it's not too bad; though below I'm now wondering if it is appropriate. > > Now you're confusing me: I'm very glad to hear it, I feel less alone. > I think we do need FOLL_FOLLOW to > set up read-only access -- we want to trigger the COWs that userspace > might trigger by touching the memory up front. This is to handle > a case like > > [userspace] > int *buf = malloc(16 * 4096); > // buf now points to 16 anonymous zero_pages > mr = ibv_reg_mr(pd, buf, 16 * 4096, IBV_ACCESS_REMOTE_READ); > // RDMA HW will only ever read buf, but... > buf[0] = 2012; > // COW triggered, first page of buf changed, RDMA HW has wrong mapping! > > For something the RDMA HW might write to, then I agree we don't want > FOLL_FOLLOW -- we just would use FOLL_WRITE as we currently do. Ah, okay, something earlier in the thread had thrown me off that track, I thought we were expecting the ibv_reg_mr to give the remote the same permissions as the user had. Or something, maybe I'm just making excuses for being dense. But then I wonder if FOLL_FOLLOW is actually the behaviour you need. Imagine a PROT_READ MAP_PRIVATE area (just as in your original mail): what if the user does mprotect PROT_READ|PROT_WRITE on that afterwards, and then proceeds to touch it. The old write=1 force=1 GUP would have pre-COWed that and no problem, but FOLL_FOLLOW will not. Maybe you can answer "don't do that"; but you do then appear to be trading one kind of "don't do that" for another. Maybe it depends on what libraries might get up to: aren't there (debug? garbage collection?) memalloc libraries which give out memory protected until you touch it? Maybe you need FOLL_PRECOW, which does write=1 force=1 on the private areas, but just faults in the shared areas (avoiding the bizarre forced COW on shared areas). > > When I get around to coding this up, I think I'm going to spend a lot > of time on the comments and on the commit log :) I am sorry to be driving you to such effort, honestly. Hugh ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2012-02-09 22:57 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-01-26 5:59 [PATCH/RFC G-U-P experts] IB/umem: Modernize our get_user_pages() parameters Roland Dreier 2012-01-26 20:01 ` Hugh Dickins 2012-01-26 22:45 ` Roland Dreier 2012-01-27 17:28 ` Roland Dreier 2012-01-28 2:31 ` Hugh Dickins 2012-01-28 19:25 ` Jason Gunthorpe 2012-01-30 19:19 ` Roland Dreier 2012-01-28 2:19 ` Hugh Dickins 2012-01-30 19:16 ` Roland Dreier 2012-01-30 20:20 ` Andrea Arcangeli 2012-02-06 17:46 ` Roland Dreier 2012-01-30 20:34 ` Hugh Dickins 2012-02-06 17:39 ` Roland Dreier 2012-02-07 20:39 ` Hugh Dickins 2012-02-08 23:10 ` Hugh Dickins 2012-02-09 17:50 ` Roland Dreier 2012-02-09 22:57 ` Hugh Dickins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).