All of lore.kernel.org
 help / color / mirror / Atom feed
* get_user_pages question
@ 2009-11-09  6:50 Mark Veltzer
  2009-11-09  9:31 ` Andi Kleen
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Veltzer @ 2009-11-09  6:50 UTC (permalink / raw)
  To: linux-kernel

Hello all!

I have searched the list for similar issues and have not found an answer so I 
am posting.

I am using 'get_user_pages' and friends to get a hold of user memory in kernel 
space. User space passes buffer to kernel, kernel does get_user_pages, holds 
them for some time while user space is doing something else, writes to the 
pages and then releases them (SetPageDirty and page_cache_release as per LDD 
3rd edition). So far so good. 

I am testing this kernel module with several buffers from user space allocated 
in several different ways. heap, data segment, static variable in function and 
stack. All scenarious work EXCEPT the stack one. When passing the stack buffer 
the kernel sees one thing while user space sees another.

My not so intelligent questions (they may well be off the mark):
- How can this be? (two views of the same page)
- Does not 'get_user_pages' pin the pages?
- Could this be due to stack protection of some sort?
- Do I need to do anything extra with the vm_area I receive for the stack 
pages EXCEPT 'get_user_pages' ?

I know this is not an orthodox method to write a driver and I better use mmap 
for these things but I have other constrains in this driver design that I do 
not want to bore you with. I am also awara that passing a buffer on stack and 
letting user space continue running is a very dangerous thing to do for user 
space (or kernel space) integrity. I wish I could do it another way...

The platform is x86 32 bit standad with standard kernels and headers 
distributed with ubuntu 9.04 and 9.10 which are 2.6.28 and 2.6.31.

Please reply to my email as well as I am not a subscriber.

Cheers,
	Mark

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: get_user_pages question
  2009-11-09  6:50 get_user_pages question Mark Veltzer
@ 2009-11-09  9:31 ` Andi Kleen
  2009-11-09 10:32   ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: Andi Kleen @ 2009-11-09  9:31 UTC (permalink / raw)
  To: Mark Veltzer; +Cc: linux-kernel

Mark Veltzer <mark.veltzer@gmail.com> writes:
>
> I am testing this kernel module with several buffers from user space allocated 
> in several different ways. heap, data segment, static variable in function and 
> stack. All scenarious work EXCEPT the stack one. When passing the stack buffer 
> the kernel sees one thing while user space sees another.

In theory it should work, stack is no different from any other pages.
First thought was that you used some platform with incoherent caches,
but that doesn't seem to be the case if it's standard x86.

> My not so intelligent questions (they may well be off the mark):
> - How can this be? (two views of the same page)

It should not be on a coherent platform.

> - Does not 'get_user_pages' pin the pages?

Yes it does.

> - Could this be due to stack protection of some sort?

No.

> - Do I need to do anything extra with the vm_area I receive for the stack 
> pages EXCEPT 'get_user_pages' ?

No. Stack is like any other user memory.

Most likely it's some bug in your code.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: get_user_pages question
  2009-11-09  9:31 ` Andi Kleen
@ 2009-11-09 10:32   ` Hugh Dickins
  2009-11-09 22:13     ` Mark Veltzer
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2009-11-09 10:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Mark Veltzer, linux-kernel

On Mon, 9 Nov 2009, Andi Kleen wrote:
> Mark Veltzer <mark.veltzer@gmail.com> writes:
> >
> > I am testing this kernel module with several buffers from user space allocated 
> > in several different ways. heap, data segment, static variable in function and 
> > stack. All scenarious work EXCEPT the stack one. When passing the stack buffer 
> > the kernel sees one thing while user space sees another.
> 
> In theory it should work, stack is no different from any other pages.
> First thought was that you used some platform with incoherent caches,
> but that doesn't seem to be the case if it's standard x86.

It may be irrelevant to Mark's stack case, but it is worth mentioning
the fork problem: how a process does get_user_pages to pin down a buffer
somewhere in anonymous memory, a thread forks (write protecting anonymous
memory shared between parent and child), child userspace writes to a
location in the same page as that buffer, causing copy-on-write which
breaks the connection between the get_user_pages buffer and what child
userspace sees there afterwards.

Hugh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: get_user_pages question
  2009-11-09 10:32   ` Hugh Dickins
@ 2009-11-09 22:13     ` Mark Veltzer
  2009-11-10 16:33       ` Hugh Dickins
  0 siblings, 1 reply; 13+ messages in thread
From: Mark Veltzer @ 2009-11-09 22:13 UTC (permalink / raw)
  To: Hugh Dickins, linux-kernel, Andi Kleen

On Monday 09 November 2009 12:32:52 you wrote:
> On Mon, 9 Nov 2009, Andi Kleen wrote:
> > Mark Veltzer <mark.veltzer@gmail.com> writes:
> > > I am testing this kernel module with several buffers from user space
> > > allocated in several different ways. heap, data segment, static
> > > variable in function and stack. All scenarious work EXCEPT the stack
> > > one. When passing the stack buffer the kernel sees one thing while user
> > > space sees another.
> >
> > In theory it should work, stack is no different from any other pages.
> > First thought was that you used some platform with incoherent caches,
> > but that doesn't seem to be the case if it's standard x86.
> 
> It may be irrelevant to Mark's stack case, but it is worth mentioning
> the fork problem: how a process does get_user_pages to pin down a buffer
> somewhere in anonymous memory, a thread forks (write protecting anonymous
> memory shared between parent and child), child userspace writes to a
> location in the same page as that buffer, causing copy-on-write which
> breaks the connection between the get_user_pages buffer and what child
> userspace sees there afterwards.
> 
> Hugh
> 

Thanks Hugh and Andi

Hugh, you actually hit the nail on the head!

I was forking while doing these mappings and the child won the race and got to 
keep the pinned pages while the parent got left with a copy which meant 
nothing. The thing is that it was hard to spot because I was using a library 
function which called a function etc... which eventually did some system(3). 
It only happened on in stack testing case bacause the child was not really 
doing anything with the pinned memory on purpose and so in all other cases did 
not touch the memory except the stack which it, ofcourse, uses. The child won 
the race in the stack case and so shared the data with the kernel and the 
parent got a copy with the old data.

I understand that madvise(2) can prevent this copy-on-write and race between 
child and parent and I also duplicated it in the kernel using the following 
code:

		[lock the current->mm for writing]
			vma=find_vma(current->mm, [user pointer])
			vma->vm_flags|=VM_DONTCOPY
		[unlock the current->mm for writing]

The above code is actually a kernel version of madvise(2) and MADV_DONTFORK.

The problem with this solution (either madvise in user space or DONTCOPY in 
kernel) is that I give up the ability to fork(2) since the child is left 
stackless (or with a hold in it's stack - im not sure...)

My question is: is there a way to allow forking while still pinning STACK 
memory via get_user_pages? I can actually live with the current solution since 
I can make sure that the user space thread that does the work with the driver 
never forks but I'm interested to know what other neat vm tricks linux has up 
it's sleeve...

BTW: would it not be a good addition to the madvise(2) manpage to state that 
you should be careful with doing madvise(DONTFORK) because you may segfault 
your children and that doing so on a stack address has even more chance of 
crashing children ? Who should I talk about adding this info to the manual 
page? The current manpage that I have only talks about scatter-gather uses of 
DONTFORK and does not mention the problems of DONTFORK...

Thanks in advance
	Mark

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: get_user_pages question
  2009-11-09 22:13     ` Mark Veltzer
@ 2009-11-10 16:33       ` Hugh Dickins
  2009-11-28 18:50         ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2009-11-10 16:33 UTC (permalink / raw)
  To: Mark Veltzer
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, KOSAKI Motohiro,
	Michael Kerrisk

On Tue, 10 Nov 2009, Mark Veltzer wrote:
> On Monday 09 November 2009 12:32:52 you wrote:
> > On Mon, 9 Nov 2009, Andi Kleen wrote:
> > > Mark Veltzer <mark.veltzer@gmail.com> writes:
> > > > I am testing this kernel module with several buffers from user space
> > > > allocated in several different ways. heap, data segment, static
> > > > variable in function and stack. All scenarious work EXCEPT the stack
> > > > one. When passing the stack buffer the kernel sees one thing while user
> > > > space sees another.
> > >
> > > In theory it should work, stack is no different from any other pages.
> > > First thought was that you used some platform with incoherent caches,
> > > but that doesn't seem to be the case if it's standard x86.
> > 
> > It may be irrelevant to Mark's stack case, but it is worth mentioning
> > the fork problem: how a process does get_user_pages to pin down a buffer
> > somewhere in anonymous memory, a thread forks (write protecting anonymous
> > memory shared between parent and child), child userspace writes to a
> > location in the same page as that buffer, causing copy-on-write which
> > breaks the connection between the get_user_pages buffer and what child
> > userspace sees there afterwards.
> 
> Thanks Hugh and Andi
> 
> Hugh, you actually hit the nail on the head!

I'm glad that turned out to be relevant and helpful.

> 
> I was forking while doing these mappings and the child won the race and got to 
> keep the pinned pages while the parent got left with a copy which meant 
> nothing. The thing is that it was hard to spot because I was using a library 
> function which called a function etc... which eventually did some system(3). 
> It only happened on in stack testing case bacause the child was not really 
> doing anything with the pinned memory on purpose and so in all other cases did 
> not touch the memory except the stack which it, ofcourse, uses. The child won 
> the race in the stack case and so shared the data with the kernel and the 
> parent got a copy with the old data.
> 
> I understand that madvise(2) can prevent this copy-on-write and race between 
> child and parent and I also duplicated it in the kernel using the following 
> code:
> 
> 		[lock the current->mm for writing]
> 			vma=find_vma(current->mm, [user pointer])
> 			vma->vm_flags|=VM_DONTCOPY
> 		[unlock the current->mm for writing]
> 
> The above code is actually a kernel version of madvise(2) and MADV_DONTFORK.
> 
> The problem with this solution (either madvise in user space or DONTCOPY in 
> kernel) is that I give up the ability to fork(2) since the child is left 
> stackless (or with a hold in it's stack - im not sure...)
> 
> My question is: is there a way to allow forking while still pinning STACK 
> memory via get_user_pages? I can actually live with the current solution since 
> I can make sure that the user space thread that does the work with the driver 
> never forks but I'm interested to know what other neat vm tricks linux has up 
> it's sleeve...

I think MADV_DONTFORK is as far as we've gone,
but I might be forgetting something.

In fairness I've added Andrea and KOSAKI-san to the Cc, since I know
they are two people keen to fix this issue once and for all.  Whereas
I am with Linus in the opposite camp: solutions have looked nasty,
and short of bright new ideas, I feel we've gone as far as we ought.

Just don't do that: don't test the incompatibility of GUP pinning
versus COW semantics, by placing such buffers in problematic areas
while forking.

(That sentence might be more convincing if we put in more thought,
to enumerate precisely which areas are "problematic".)

> 
> BTW: would it not be a good addition to the madvise(2) manpage to state that 
> you should be careful with doing madvise(DONTFORK) because you may segfault 
> your children and that doing so on a stack address has even more chance of 
> crashing children ? Who should I talk about adding this info to the manual 
> page? The current manpage that I have only talks about scatter-gather uses of 
> DONTFORK and does not mention the problems of DONTFORK...

Michael looks after the manpages, I've added him to the Cc.
Yes, an additional sentence there might indeed be helpful.

Hugh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: get_user_pages question
  2009-11-10 16:33       ` Hugh Dickins
@ 2009-11-28 18:50         ` Andrea Arcangeli
  2009-11-28 22:22           ` Mark Veltzer
  2009-11-30 11:54           ` Nick Piggin
  0 siblings, 2 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2009-11-28 18:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mark Veltzer, linux-kernel, Andi Kleen, KOSAKI Motohiro,
	Michael Kerrisk, Nick Piggin

Hi Hugh and everyone,

On Tue, Nov 10, 2009 at 04:33:30PM +0000, Hugh Dickins wrote:
> In fairness I've added Andrea and KOSAKI-san to the Cc, since I know
> they are two people keen to fix this issue once and for all.  Whereas

Right, I'm sure Nick also wants to fix this once and for all (adding
him too to Cc ;).

I thought and I still think it's bad to leave races like this open for
people to find out the hard way. It just takes somebody to use
pthread_create, open a file with O_DIRECT with 512byte (not page
alignment) and call fork to trigger this, and they may find out only
later after going productive on thousand of servers... If this was a
too hard problem to fix I would understand, but I've all patches ready
to fix this completely! And they're quite localized they only touch
fork and gup and they don't alter the fast path (except for 1
conditional jump in fork that surely is lost in the noise, plus fork
is all but a fast path).

I tried to fix this in RHEL but eventually the user affected added
larger alignment to the userland app to prevent this, so it isn't as
urgent anymore and so I'd rather prefer to fix this in mainline
first. This isn't the first and surely won't be the last user that is
bitten by this, unless we take action.

> I am with Linus in the opposite camp: solutions have looked nasty,
> and short of bright new ideas, I feel we've gone as far as we ought.

There are two gup races that materializes when we wrprotect and share
an anonymous page.

bug 1) If a parent thread writes to the first half of the page while
the gup user writes to the second half of the page and then fork is
run, the O_DIRECT read from disk in the second half of the page gets
lost. In addition the child will still receive the O_DIRECT writes to
memory when it should not.

bug 2) The backward race happens after fork, when the parent starts an
O_DIRECT write to disk from the first half of the page, and then
writes to memory in the second half of the page, after that the child
writes to the page will be read by the parent direct-io.

fix for bug 1) is what Nick and me implemented, that consists in
copying (instead of sharing) anon pages during fork, if they could be
under gup. The two implementations are vastly different but they look
to do the same thing (he used bitflags in the vma and in the page, I
only used a bitflag in the page, worst thing of my patch was having to
set that bitflag in gup_fast too, I don't like having to add a bit to
the vma when a bit in the page is enough).

fix A for bug 2) is what KOSAKI tried to implement in message-id
20090414151554.C64A.A69D9226. The trick is in having do_wp_page not
taking over a page under GUP (that means reuse_swap_cache has to take
the page_count into account too, not just the mapcount). However
taking page_count into account in reuse_swap_cache, means that it
won't be capable of taking over a page under gup that got temporarily
converted to swapcache and unmapped, so leading to losing O_DIRECT
reads from disk during paging. So another change is required to rmap
code to prevent ever unmapping any pinned anon page that could be
under GUP to avoid losing I/O during paging.

fix B for bug 2) is what Nick and me implemented, that consists in
always de-cowing anon shared pages during gup even in case of
gup(write=0). That's much simpler than fix A for bug 2 and the fix
doesn't affect rmap swap semantics, but it loses some sharing
capability in gup(write=0) cases, not a practical matter though.

All other patches floating around spread an mm-wide semaphore over
fork fast path, and across O_DIRECT, nfs, and aio, and they most
certainly didn't fix the two races for all gup users, and they weren't
stable because of having to identify the closure of the I/O across all
possible put_page. That approach kind of opens a can of worms and it
looks the wrong way to go to me, and I think they scale worse too for
the fast path (no O_DIRECT or no fork). Identifying the gup closure
points and replacing the raw put_page with gup_put_page would not be
an useless effort though and I felt if the gup API was just a little
bit more sophisticated I could simplify a bit the put_compound_page to
serialize the race against split_huge_page_refcount, but this is an
orthogonal issue with the mm-wide semaphore release addition which I
personally dislike.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: get_user_pages question
  2009-11-28 18:50         ` Andrea Arcangeli
@ 2009-11-28 22:22           ` Mark Veltzer
  2009-11-30 12:01             ` Nick Piggin
  2009-11-30 11:54           ` Nick Piggin
  1 sibling, 1 reply; 13+ messages in thread
From: Mark Veltzer @ 2009-11-28 22:22 UTC (permalink / raw)
  To: Andrea Arcangeli, linux-kernel
  Cc: Hugh Dickins, Andi Kleen, KOSAKI Motohiro, Michael Kerrisk, Nick Piggin

On Saturday 28 November 2009 20:50:52 you wrote:
> Hi Hugh and everyone,
> 
> On Tue, Nov 10, 2009 at 04:33:30PM +0000, Hugh Dickins wrote:
> > In fairness I've added Andrea and KOSAKI-san to the Cc, since I know
> > they are two people keen to fix this issue once and for all.  Whereas
> 
> Right, I'm sure Nick also wants to fix this once and for all (adding
> him too to Cc ;).
> 
> I thought and I still think it's bad to leave races like this open for
> people to find out the hard way. It just takes somebody to use
> pthread_create, open a file with O_DIRECT with 512byte (not page
> ....

Hello all!

First let me state that I solved my problems by simply avoiding GUP completely 
and going with a clean mmap implemenation (with the nopage version) which 
causes no problems what so ever. mmap does not suffer from all the problems 
discussed above (aside from the fact that you have to do your own book keeping 
as far as vma_open and vma_close and fault function goes...). Please correct 
me if I'm wrong...:)

The fact that I solved all my problems with mmap and the complexity of the 
proposed solutions got me thinking about GUP in more general terms. Would it 
be fair to say that mmap is much more aligned to the kernels way of doing 
things than GUP? It feels like the vma concept which is a solid one and 
probably works well for most architectures is in conflict with GUP and so is 
fork. It also feels like the vma concept is well aligned with fork which means 
in turn that mmap is well aligned with fork while GUP is not. This is a new 
conclusion for me and one which did not register back when reading the LDD 
book (I got the impression that you can pick between mmap and GUP and it does 
not really matter but now I feel that mmap is much advanced and trouble free).

Testing it out I grepped the drivers folder of a recent kernel with ONLY 26 
mentions of GUP in the entire drivers folder! The main drivers using GUP are 
scsi st and infiniband. If GUP is so unused is it really essential as an in 
kernel interface? If GUP is slowly dropped and drivers converted to mmap would 
it not simplify kernel code or at least prevent complex solutions to GUP 
problems from complicating mm code even more? Again, I'm no kernel expert so 
please don't flame me too hard if I'm talking heresy or nonsense, I would just 
like to hear your take on this. It may well be that I simply have no clue and 
so my conclusions are way too radical...

Cheers,
	Mark

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: get_user_pages question
  2009-11-28 18:50         ` Andrea Arcangeli
  2009-11-28 22:22           ` Mark Veltzer
@ 2009-11-30 11:54           ` Nick Piggin
  1 sibling, 0 replies; 13+ messages in thread
From: Nick Piggin @ 2009-11-30 11:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Mark Veltzer, linux-kernel, Andi Kleen,
	KOSAKI Motohiro, Michael Kerrisk

On Sat, Nov 28, 2009 at 07:50:52PM +0100, Andrea Arcangeli wrote:
> All other patches floating around spread an mm-wide semaphore over
> fork fast path, and across O_DIRECT, nfs, and aio, and they most
> certainly didn't fix the two races for all gup users, and they weren't
> stable because of having to identify the closure of the I/O across all
> possible put_page. That approach kind of opens a can of worms and it
> looks the wrong way to go to me, and I think they scale worse too for
> the fast path (no O_DIRECT or no fork). Identifying the gup closure
> points and replacing the raw put_page with gup_put_page would not be
> an useless effort though and I felt if the gup API was just a little
> bit more sophisticated I could simplify a bit the put_compound_page to
> serialize the race against split_huge_page_refcount, but this is an
> orthogonal issue with the mm-wide semaphore release addition which I
> personally dislike.

IIRC, the last time this came up, it kind of became stalled on this
point. Linus hated our "preemptive cow" approaches, and thought the
above approach was better.

I don't think we need to bother arguing details between our former
approaches until we get past this sticking point.

FWIW, I need to change get_user_pages semantics somewhat because we
have filesystems that cannot tolerate a set_page_dirty() to dirty a
clean page (it must only be dirtied with page_mkwrite).

This should probably require converting callers to use put_user_pages
and disallowing lock_page, mmap_sem, user-copy etc. within these
sections.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: get_user_pages question
  2009-11-28 22:22           ` Mark Veltzer
@ 2009-11-30 12:01             ` Nick Piggin
  2009-11-30 16:12               ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Nick Piggin @ 2009-11-30 12:01 UTC (permalink / raw)
  To: Mark Veltzer
  Cc: Andrea Arcangeli, linux-kernel, Hugh Dickins, Andi Kleen,
	KOSAKI Motohiro, Michael Kerrisk

On Sun, Nov 29, 2009 at 12:22:17AM +0200, Mark Veltzer wrote:
> On Saturday 28 November 2009 20:50:52 you wrote:
> > Hi Hugh and everyone,
> > 
> > On Tue, Nov 10, 2009 at 04:33:30PM +0000, Hugh Dickins wrote:
> > > In fairness I've added Andrea and KOSAKI-san to the Cc, since I know
> > > they are two people keen to fix this issue once and for all.  Whereas
> > 
> > Right, I'm sure Nick also wants to fix this once and for all (adding
> > him too to Cc ;).
> > 
> > I thought and I still think it's bad to leave races like this open for
> > people to find out the hard way. It just takes somebody to use
> > pthread_create, open a file with O_DIRECT with 512byte (not page
> > ....
> 
> Hello all!
> 
> First let me state that I solved my problems by simply avoiding GUP completely 
> and going with a clean mmap implemenation (with the nopage version) which 
> causes no problems what so ever. mmap does not suffer from all the problems 
> discussed above (aside from the fact that you have to do your own book keeping 
> as far as vma_open and vma_close and fault function goes...). Please correct 
> me if I'm wrong...:)
> 
> The fact that I solved all my problems with mmap and the complexity of the 
> proposed solutions got me thinking about GUP in more general terms. Would it 
> be fair to say that mmap is much more aligned to the kernels way of doing 
> things than GUP? It feels like the vma concept which is a solid one and 
> probably works well for most architectures is in conflict with GUP and so is 
> fork. It also feels like the vma concept is well aligned with fork which means 
> in turn that mmap is well aligned with fork while GUP is not. This is a new 
> conclusion for me and one which did not register back when reading the LDD 
> book (I got the impression that you can pick between mmap and GUP and it does 
> not really matter but now I feel that mmap is much advanced and trouble free).
> 
> Testing it out I grepped the drivers folder of a recent kernel with ONLY 26 
> mentions of GUP in the entire drivers folder! The main drivers using GUP are 
> scsi st and infiniband. If GUP is so unused is it really essential as an in 
> kernel interface? If GUP is slowly dropped and drivers converted to mmap would 
> it not simplify kernel code or at least prevent complex solutions to GUP 
> problems from complicating mm code even more? Again, I'm no kernel expert so 
> please don't flame me too hard if I'm talking heresy or nonsense, I would just 
> like to hear your take on this. It may well be that I simply have no clue and 
> so my conclusions are way too radical...

GUP is basically required to do any kind of IO operations on user
addresses (those not owned by your driver, ie. arbitrary addresses)
without first copying memory into kernel.

If you can wean O_DIRECT off get_user_pages, you'd have most of the
battle won. I don't think it's really possible though.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: get_user_pages question
  2009-11-30 12:01             ` Nick Piggin
@ 2009-11-30 16:12               ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2009-11-30 16:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mark Veltzer, linux-kernel, Hugh Dickins, Andi Kleen,
	KOSAKI Motohiro, Michael Kerrisk

On Mon, Nov 30, 2009 at 01:01:45PM +0100, Nick Piggin wrote:
> If you can wean O_DIRECT off get_user_pages, you'd have most of the
> battle won. I don't think it's really possible though.

Agreed. Not just O_DIRECT, virtualization requires it too, the kvm
page fault calls get_user_pages, practically anything that uses mmu
notifier also uses get_user_pages. There are things you simply can't
do without it.

In general if the memory doesn't need to be persistently stored on
disk to survive task killage, there's not much point in using
pagecache MAP_SHARED on-disk, instead of anonymous memory, this is why
anonymous memory is backing malloc, and there's no reason why people
should be prevented to issue disk I/O in zero-copy with anonymous
memory (or tmpfs), if they know they access this data only once and
they want to manage the cache in some logical form rather than in
physical on-disk format (or if there are double physical caches more
efficient kept elsewhere, like in KVM guest case).

OTOH if you'd be using the I/O data in physical format in your
userland memory, then using pagecache by mmapping the file and
disabling O_DIRECT on the filesystem is surely preferred and more
efficient (if nothing else, because it also provides caching just in
case).

For drivers (Mark's case) it depends, but if you can avoid to use
get_user_pages without slowing down anything you should, that usually
makes code simpler... and it won't risk to suffer from these race
conditions either ;).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: get_user_pages question
  2004-05-01 11:32 ` Arjan van de Ven
@ 2004-05-01 11:41   ` Eli Cohen
  0 siblings, 0 replies; 13+ messages in thread
From: Eli Cohen @ 2004-05-01 11:41 UTC (permalink / raw)
  To: arjanv; +Cc: linux-kernel

Arjan van de Ven wrote:

>>I used 2.4.21-4 (RH AS 3.0).
>>    
>>
>
>I still have grave doubts about what you try to do... is this the code
>at openib.org ? or is there some other URL where the code is visible ?
>
>  
>
It's not in openib since I am not using it since it is not doing what I 
need. The code in openib calls sys_mlock to lock the pages in the 
process's address spcae. In my latst description I meant that the two 
calls to get_user_pages were made on the same buffer.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: get_user_pages question
  2004-05-01 11:12 Eli Cohen
@ 2004-05-01 11:32 ` Arjan van de Ven
  2004-05-01 11:41   ` Eli Cohen
  0 siblings, 1 reply; 13+ messages in thread
From: Arjan van de Ven @ 2004-05-01 11:32 UTC (permalink / raw)
  To: Eli Cohen; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 480 bytes --]


> Apparently some pages were discarded and the subsequent page fault 
> brought a new page. I expected the original page to be in the swap cache 
> and get the old page again. I repeated the experiment but before the 
> first ioctl I wrote something to all the pages but got the same results. 
> I used 2.4.21-4 (RH AS 3.0).

I still have grave doubts about what you try to do... is this the code
at openib.org ? or is there some other URL where the code is visible ?


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* get_user_pages question
@ 2004-05-01 11:12 Eli Cohen
  2004-05-01 11:32 ` Arjan van de Ven
  0 siblings, 1 reply; 13+ messages in thread
From: Eli Cohen @ 2004-05-01 11:12 UTC (permalink / raw)
  To: linux-kernel

Hi,
I have been tryin to use get_user_pages() on malloced memory and get the 
list of pages but it does not work as
I expected:
1. malloc a buffer in user space
2. issue ioctl, invoke get_user_pages() and save the page descriptors I 
obtained.
3. at a later time, issue another ioctl, invoke get_user_pages() again 
and save another copy of the page descriptors.
4. Compare the two lists of page descriptors. They're not all the same.

Apparently some pages were discarded and the subsequent page fault 
brought a new page. I expected the original page to be in the swap cache 
and get the old page again. I repeated the experiment but before the 
first ioctl I wrote something to all the pages but got the same results. 
I used 2.4.21-4 (RH AS 3.0).

Can anyone clarify?
thanks
Eli

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-11-30 16:12 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-09  6:50 get_user_pages question Mark Veltzer
2009-11-09  9:31 ` Andi Kleen
2009-11-09 10:32   ` Hugh Dickins
2009-11-09 22:13     ` Mark Veltzer
2009-11-10 16:33       ` Hugh Dickins
2009-11-28 18:50         ` Andrea Arcangeli
2009-11-28 22:22           ` Mark Veltzer
2009-11-30 12:01             ` Nick Piggin
2009-11-30 16:12               ` Andrea Arcangeli
2009-11-30 11:54           ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2004-05-01 11:12 Eli Cohen
2004-05-01 11:32 ` Arjan van de Ven
2004-05-01 11:41   ` Eli Cohen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.