All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
       [not found] <20090311170611.GA2079@elte.hu>
@ 2009-03-11 17:33 ` Linus Torvalds
  2009-03-11 17:41   ` Ingo Molnar
  2009-03-11 18:22   ` Andrea Arcangeli
  0 siblings, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2009-03-11 17:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Hugh Dickins, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm


On Wed, 11 Mar 2009, Ingo Molnar wrote:
> 
> FYI, in case you missed it. Large MM fix - and it's awfully late 
> in -rc7.

Yeah, I'm not taking this at this point. No way, no-how.

If there is no simpler and obvious fix, it needs to go through -stable, 
after having cooked in 2.6.30-rc for a while. Especially as this is a 
totally uninteresting usage case that I can't see as being at all relevant 
to any real world.

Anybody who mixes O_DIRECT and fork() (and threads) is already doing some 
seriously strange things. Nothing new there.

And quite frankly, the patch is so ugly as-is that I'm not likely to take 
it even into the 2.6.30 merge window unless it can be cleaned up. That 
whole fork_pre_cow function is too f*cking ugly to live. We just don't 
write code like this in the kernel.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 17:33 ` [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Linus Torvalds
@ 2009-03-11 17:41   ` Ingo Molnar
  2009-03-11 17:58     ` Linus Torvalds
  2009-03-11 18:53     ` Andrea Arcangeli
  2009-03-11 18:22   ` Andrea Arcangeli
  1 sibling, 2 replies; 83+ messages in thread
From: Ingo Molnar @ 2009-03-11 17:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Hugh Dickins, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, 11 Mar 2009, Ingo Molnar wrote:
> > 
> > FYI, in case you missed it. Large MM fix - and it's awfully 
> > late in -rc7.
> 
> Yeah, I'm not taking this at this point. No way, no-how.
> 
> If there is no simpler and obvious fix, it needs to go through 
> -stable, after having cooked in 2.6.30-rc for a while. 
> Especially as this is a totally uninteresting usage case that 
> I can't see as being at all relevant to any real world.
> 
> Anybody who mixes O_DIRECT and fork() (and threads) is already 
> doing some seriously strange things. Nothing new there.

Hm, is there any security impact? Andrea is talking about data 
corruption. I'm wondering whether that's just corruption 
relative to whatever twisted semantics O_DIRECT has in this case 
[which would be harmless], or some true pagecache corruption 
going across COW (or other) protection domains that could be 
exploited [which would not be harmless].

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 17:41   ` Ingo Molnar
@ 2009-03-11 17:58     ` Linus Torvalds
  2009-03-11 18:37       ` Andrea Arcangeli
  2009-03-11 18:53     ` Andrea Arcangeli
  1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-11 17:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Hugh Dickins, Andrea Arcangeli, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm



On Wed, 11 Mar 2009, Ingo Molnar wrote:
> 
> Hm, is there any security impact? Andrea is talking about data 
> corruption. I'm wondering whether that's just corruption 
> relative to whatever twisted semantics O_DIRECT has in this case 
> [which would be harmless], or some true pagecache corruption 
> going across COW (or other) protection domains that could be 
> exploited [which would not be harmless].

As far as I can tell, it's the same old problem that we've always had: if 
you fork(), it's unclear who is going to do the first write - parent or 
child (and "parent" in this case can include any number of threads that 
share the VM, of course).

And that means that anything that relies on pinned pages will never know 
whether it is pinning a page in the parent or the child - because whoever 
does the first COW of that page is the one that just gets a _copy_, not 
the original pinned page.

This isn't anything new. Anything that does anything by physical address 
will simply not do the right thing over a fork. The physical page may have 
started out as the parents physical page, but it may end up in the end 
being the _childs_ physical page if the parent wrote to it and triggered 
the cow.

The rule has always been: don't mix fork() with page pinning. It doesn't 
work. It never worked. It likely never will.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 17:33 ` [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Linus Torvalds
  2009-03-11 17:41   ` Ingo Molnar
@ 2009-03-11 18:22   ` Andrea Arcangeli
  2009-03-11 19:06     ` Ingo Molnar
  1 sibling, 1 reply; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-11 18:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm

On Wed, Mar 11, 2009 at 10:33:00AM -0700, Linus Torvalds wrote:
> 
> On Wed, 11 Mar 2009, Ingo Molnar wrote:
> > 
> > FYI, in case you missed it. Large MM fix - and it's awfully late 
> > in -rc7.

I didn't specify it, but I didn't mean to submit it for immediate
inclusion. I posted it because it's ready and I wanted feedback from
Hugh/Nick/linux-mm so we can get this fixed when next merge window
open.

> Yeah, I'm not taking this at this point. No way, no-how.
> 
> If there is no simpler and obvious fix, it needs to go through -stable, 
> after having cooked in 2.6.30-rc for a while. Especially as this is a 
> totally uninteresting usage case that I can't see as being at all relevant 
> to any real world.

Actually AFIK there are mission critical real world applications that
used 512byte blocksize that were affected by this (I CC'ed relevant
people who knows). However this is rare thing so it almost never
triggers because the window is so small.

> Anybody who mixes O_DIRECT and fork() (and threads) is already doing some 
> seriously strange things. Nothing new there.

Most apps aren't affected of course. But almost all apps eventually
call fork (system/fork/exec/anything). Calling fork currently is
enough to generate memory corruption in the parent (i.e. lost O_DIRECT
reads from disk).

> And quite frankly, the patch is so ugly as-is that I'm not likely to take 
> it even into the 2.6.30 merge window unless it can be cleaned up. That 
> whole fork_pre_cow function is too f*cking ugly to live. We just don't 
> write code like this in the kernel.

Yes, this is exactly why I posted it now, to get feedback, it wasn't
meant for submission. Feel free to write it yourself in another way of
course, I included all relevant testcases to test alternate fixes too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 17:58     ` Linus Torvalds
@ 2009-03-11 18:37       ` Andrea Arcangeli
  2009-03-11 18:46         ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-11 18:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm

On Wed, Mar 11, 2009 at 10:58:17AM -0700, Linus Torvalds wrote:
> As far as I can tell, it's the same old problem that we've always had: if 
> you fork(), it's unclear who is going to do the first write - parent or 
> child (and "parent" in this case can include any number of threads that 
> share the VM, of course).

The child doesn't touch any page. Calling fork just generates O_DIRECT
corruption in the parent regardless of what the child does.

> This isn't anything new. Anything that does anything by physical address 

This is nothing new also in the sense that all linux kernels out there
had this bug thus far.

> will simply not do the right thing over a fork. The physical page may have 
> started out as the parents physical page, but it may end up in the end 
> being the _childs_ physical page if the parent wrote to it and triggered 
> the cow.

Actually the child will get corrupted too. Not just the parent by
losing the O_DIRECT reads. The child always assumes its anon page
contents will not get lost or overwritten after changing them in the
child.

> The rule has always been: don't mix fork() with page pinning. It doesn't 
> work. It never worked. It likely never will.

I never heard this rule here, but surely I agree there will not be
many apps out there capable of triggering this. Mostly because most
apps uses O_DIRECT on top of shm (surely not because they're not
usually calling fork). The ones affected are the ones using anonymous
memory with threads and not allocating memory with memalign(4096)
despite they use 512byte blocksize for their I/O. If they use threads
and they allocate with memalign(512) they can be affected if they call
fork anywhere.

I don't think it's urgent fix, but if you now are pretending that this
doesn't ever need fixing and we can live with the bug forever, I think
you're wrong. If something I'd rather see O_DIRECT not supporting
hardblocksize anymore but only PAGE_SIZE multiples, that would at
least limiting the breakage to an undefined behavior.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 18:37       ` Andrea Arcangeli
@ 2009-03-11 18:46         ` Linus Torvalds
  2009-03-11 19:01           ` Linus Torvalds
                             ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Linus Torvalds @ 2009-03-11 18:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm



On Wed, 11 Mar 2009, Andrea Arcangeli wrote:

> On Wed, Mar 11, 2009 at 10:58:17AM -0700, Linus Torvalds wrote:
> > As far as I can tell, it's the same old problem that we've always had: if 
> > you fork(), it's unclear who is going to do the first write - parent or 
> > child (and "parent" in this case can include any number of threads that 
> > share the VM, of course).
> 
> The child doesn't touch any page. Calling fork just generates O_DIRECT
> corruption in the parent regardless of what the child does.

You aren't listening.

It depends on who does the write. If the _parent_ does the write (with 
another thread or not), then the _parent_ gets the COW.

That's all I said.

> > The rule has always been: don't mix fork() with page pinning. It doesn't 
> > work. It never worked. It likely never will.
> 
> I never heard this rule here

It's never been written down, but it's obvious to anybody who looks at how 
COW works for even five seconds. The fact is, the person doing the COW 
after a fork() is the person who no longer has the same physical page 
(because he got a new page).

So _anything- that depends on physical addresses simply _cannot_ work 
concurrently with a fork. That has always been true.

If the idiots who use O_DIRECT don't understand that, then hey, it's their 
problem. I have long been of the opinion that we should not support 
O_DIRECT at all, and that it's a totally broken premise to start with. 

This is just one of millions of reasons.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 17:41   ` Ingo Molnar
  2009-03-11 17:58     ` Linus Torvalds
@ 2009-03-11 18:53     ` Andrea Arcangeli
  1 sibling, 0 replies; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-11 18:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm

Hello,

On Wed, Mar 11, 2009 at 06:41:03PM +0100, Ingo Molnar wrote:
> Hm, is there any security impact? Andrea is talking about data 
> corruption. I'm wondering whether that's just corruption 
> relative to whatever twisted semantics O_DIRECT has in this case 
> [which would be harmless], or some true pagecache corruption 

I don't think it's exploitable and I don't see this much as a security
issue. This can only corrupt user data inside anonymous pages (not
filesystem metadata or kernel pagecache). Side effects will be the
usual ones of random user memory corruption or as worse it can lead to
I/O corruption on disk, but only in user data.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 18:46         ` Linus Torvalds
@ 2009-03-11 19:01           ` Linus Torvalds
  2009-03-11 19:59             ` Andrea Arcangeli
  2009-03-11 19:06           ` Andrea Arcangeli
  2009-03-12  5:36           ` Nick Piggin
  2 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-11 19:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm



On Wed, 11 Mar 2009, Linus Torvalds wrote:
> 
> It's never been written down, but it's obvious to anybody who looks at how 
> COW works for even five seconds. The fact is, the person doing the COW 
> after a fork() is the person who no longer has the same physical page 
> (because he got a new page).

Btw, I think your patch has a race. Admittedly a really small one.

When you look up the page in gup.c, and then set the GUP flag on the 
"struct page", in between the lookup and the setting of the flag, another 
thread can come in and do that same fork+write thing.

	CPU0:			CPU1

	gup:			fork:
	 - look up page
	 - it's read-write
	...
				set_wr_protect
				test GUP bit - not set, good
				done

	- Mark it GUP
				tlb_flush

				write to it from user space - COW

since there is no lockng on the GUP side (there's the TLB flush that will 
wait for interrupts being enabled again on CPU0, but that's later in the 
fork sequence).

Maybe I'm missing something. The race is certainly very unlikely to ever 
happen in practice, but it looks real.

Also, having to set the PG_GUP bit means that the "fast" gup is likely not 
much faster than the slow one. It now has two atomics per page it looks 
up, afaik, which sounds like it would delete any advantage it had over the 
slow version that needed locking.
							
What we _could_ try to do is to always make the COW breaking be a 
_directed_ event - we'd make sure that we always break COW in the 
direction of the first owner (going to the rmap chains). That might solve 
everything, and be purely local to the logic in mm/memory.c (do_wp_page).

I dunno. I have not looked at how horrible that would be.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 18:22   ` Andrea Arcangeli
@ 2009-03-11 19:06     ` Ingo Molnar
  2009-03-11 19:15       ` Andrea Arcangeli
  0 siblings, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2009-03-11 19:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm


* Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Mar 11, 2009 at 10:33:00AM -0700, Linus Torvalds wrote:
> > 
> > On Wed, 11 Mar 2009, Ingo Molnar wrote:
> > > 
> > > FYI, in case you missed it. Large MM fix - and it's awfully late 
> > > in -rc7.
> 
> I didn't specify it, but I didn't mean to submit it for 
> immediate inclusion. I posted it because it's ready and I 
> wanted feedback from Hugh/Nick/linux-mm so we can get this 
> fixed when next merge window open.

Good - i saw the '(fast-)gup fix' qualifier and fast-gup is a 
fresh feature. If the problem existed in earlier kernels too 
then i guess it isnt urgent.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 18:46         ` Linus Torvalds
  2009-03-11 19:01           ` Linus Torvalds
@ 2009-03-11 19:06           ` Andrea Arcangeli
  2009-03-12  5:36           ` Nick Piggin
  2 siblings, 0 replies; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-11 19:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm

On Wed, Mar 11, 2009 at 11:46:17AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 11 Mar 2009, Andrea Arcangeli wrote:
> 
> > On Wed, Mar 11, 2009 at 10:58:17AM -0700, Linus Torvalds wrote:
> > > As far as I can tell, it's the same old problem that we've always had: if 
> > > you fork(), it's unclear who is going to do the first write - parent or 
> > > child (and "parent" in this case can include any number of threads that 
> > > share the VM, of course).
> > 
> > The child doesn't touch any page. Calling fork just generates O_DIRECT
> > corruption in the parent regardless of what the child does.
> 
> You aren't listening.
> 
> It depends on who does the write. If the _parent_ does the write (with 
> another thread or not), then the _parent_ gets the COW.
> 
> That's all I said.

I only wanted to clarify this doesn't require the child to touch the
page at all.

> If the idiots who use O_DIRECT don't understand that, then hey, it's their 
> problem. I have long been of the opinion that we should not support 
> O_DIRECT at all, and that it's a totally broken premise to start with. 

Well if you don't like it used by databases, O_DIRECT is still ideal for
KVM. Guest caches runs at cpu core speed unlike host cache. Not that
KVM can reproduce this bug (all ram where KVM would be doing O_DIRECT
is mapped MADV_DONTFORK, and besides guest physical ram has to be
allocated with memalign(4096) ;).

Said that I agree it'd be better off to nuke O_DIRECT than to leave
this bug as O_DIRECT should not break the usual memory-protection
semantics provided by read() and fork() syscalls.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 19:06     ` Ingo Molnar
@ 2009-03-11 19:15       ` Andrea Arcangeli
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-11 19:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm

On Wed, Mar 11, 2009 at 08:06:55PM +0100, Ingo Molnar wrote:
> Good - i saw the '(fast-)gup fix' qualifier and fast-gup is a 
> fresh feature. If the problem existed in earlier kernels too 
> then i guess it isnt urgent.

It always existed yes. The reason of the (fast-) qualifier is because
gup-fast made it harder to fix this in mainline (there is also a patch
floating around for 2.6.18 based kernels that is simpler thanks to
gup-fast not being there). The trouble of gup-fast is that doing the
check of page_count inside PT lock (or mmap_sem write mode like in
fork(), but ksm only takes mmap_sem in read mode and it relied on PT
lock only) wasn't enough anymore to be sure the page_count wouldn't
increase from under us just after we read it, because a gup-fast could
be running in another CPU without mmap_sem and without PT lock
taken. So fixing this on mainline has been a bit harder as I had to
prevent gup-fast to go ahead in the fast path, in a way that didn't
send IPIs to flush the smp-tlb before reading the page_count (so to
avoid sending IPIs for every anon page mapped writeable).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 19:01           ` Linus Torvalds
@ 2009-03-11 19:59             ` Andrea Arcangeli
  2009-03-11 20:19               ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-11 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm

On Wed, Mar 11, 2009 at 12:01:56PM -0700, Linus Torvalds wrote:
> Btw, I think your patch has a race. Admittedly a really small one.
> 
> When you look up the page in gup.c, and then set the GUP flag on the 
> "struct page", in between the lookup and the setting of the flag, another 
> thread can come in and do that same fork+write thing.
> 
> 	CPU0:			CPU1
> 
> 	gup:			fork:
> 	 - look up page
> 	 - it's read-write
> 	...
> 				set_wr_protect
> 				test GUP bit - not set, good
> 				done
> 
> 	- Mark it GUP
> 				tlb_flush
> 
> 				write to it from user space - COW

Did you notice the check after 'mark it gup' that will run in CPU0?

+		if (PageAnon(page)) {
+			if (!PageGUP(page))
+				SetPageGUP(page);
+			smp_mb();
+			/*
+			 * Fork doesn't want to flush the smp-tlb for
+			 * every pte that it marks readonly but newly
+			 * created shared anon pages cannot have
+			 * direct-io going to them, so check if fork
+			 * made the page shared before we taken the
+			 * page pin.
+			 */
+			if ((pte_flags(gup_get_pte(ptep)) &
+			     (mask | _PAGE_SPECIAL)) != mask) {
+				put_page(page);
+				pte_unmap(ptep);
+				return 0;
+			}
+		}

gup-fast will _not_ succeed because of the set_wr_protect that just
happened on CPU1. That's why I added the above check after setpagegup/get_page.

> since there is no lockng on the GUP side (there's the TLB flush that will 
> wait for interrupts being enabled again on CPU0, but that's later in the 
> fork sequence).

Right, I preferred to 'recheck' the wrprotect bit before allowing
gup-fast to succeed to avoid sending a flood of IPI in the fork fast
path. So I leave the tlb flush at the end of the fork sequence and a
single IPI in the common case.

Only exception is the forcecow path where the copy has to happen
atomically per-page, so I have to flush the smp-tlb before the copy
after marking the parent wrprotected temporarly (later the parent pte
is marked read-write again by fork_pre_cow after the copy), or NPTL
will never have a chance to fix its bug as its glibc-parent data
structures that could be modified by threads won't be copied
atomically to the child. But that's a slow path so it's ok to flush
tlb there.

> Also, having to set the PG_GUP bit means that the "fast" gup is likely not 
> much faster than the slow one. It now has two atomics per page it looks 
> up, afaik, which sounds like it would delete any advantage it had over the 
> slow version that needed locking.

gup-fast has already to get_page, so I don't see it. gup-fast will
always dirty that cacheline and take over it regardless of PG_gup,
gup-fast will never be able to run without running
get_page. Furthermore starting from the second access GUP is already
set and it's only a read from l1 from a cacheline that was already
dirtied and taken over a few instructions before. So I think it can't
be slowing down gup-fast in any measurable way, given how close
mark-gup is set after get_page.

> What we _could_ try to do is to always make the COW breaking be a 
> _directed_ event - we'd make sure that we always break COW in the 
> direction of the first owner (going to the rmap chains). That might solve 
> everything, and be purely local to the logic in mm/memory.c (do_wp_page).

That's a really interesting idea and frankly I didn't think about it.
Probably one reason is that it can't work for ksm where we take two
random anon pages and create one out of them so each one could already
have O_DIRECT in progress on them and we've to prevent to merge pages
that have in-flight O_DIRECT to be merged no matter what (ordering is
irrelevant for ksm, page contents must be stable or ksm will
break). I was thinking of using the same logic for both ksm and fork.

But theoretically, ksm can keep doing the page_count check to truly
ensure no in-flight I/O is going on, and fork could fix it in whatever
way it wants (I wonder if it'd be ok for fork to map a 'changing' page
in the child because of the not-defined behavior of forking while a
read is in progress, at least at the first write the page would stop
changing contents). In fact ksm doesn't even require the above change
to gup-fast because it does ptep_clear_flush_notify when it tries to
wrprotect a not-shared anon page.

> I dunno. I have not looked at how horrible that would be.

For fork I think it would work, not sure if the current data
structures would be enough, but at first glance I think besides how
horrible that would be, I think from a practical standpoint the main
problem is the slowdown it'd generate in the do_wp_page fast path. The
anon_vma list can be huge in some weird case, which we normally cannot
care less as swap algorithms and disk I/O (even on no-seeking SSD) is
even slower than that. The coolness of rmap w/o pte_chains is that
rmap is zerocost for all page faults (a check on vma->anon_vma being
not null is the only cost) and I'd like to keep it that way.

The cost of my fix to fork is not measurable with fork microbenchmark,
while the cost of finding who owns the original shared page in
do_wp_page would be potentially be much bigger. The only slowdown to
fork is in the O_DIRECT slow path which we don't care about and in the
worst case is limited to the total amount of in-flight I/O.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 19:59             ` Andrea Arcangeli
@ 2009-03-11 20:19               ` Linus Torvalds
  2009-03-11 20:33                 ` Linus Torvalds
                                   ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Linus Torvalds @ 2009-03-11 20:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm



On Wed, 11 Mar 2009, Andrea Arcangeli wrote:
> 
> Did you notice the check after 'mark it gup' that will run in CPU0?

Ahh, no. I just read the patch through fairly quickly, and the whole 
"(gup_get_pte & mask) != mask" didn't trigger as obvious. But yeah, I see 
that it ends up re-checking the RW bit.

> gup-fast will _not_ succeed because of the set_wr_protect that just 
> happened on CPU1. That's why I added the above check after 
> setpagegup/get_page.

Ok, with the recheck I think it's fine.

> > Also, having to set the PG_GUP bit means that the "fast" gup is likely not 
> > much faster than the slow one. It now has two atomics per page it looks 
> > up, afaik, which sounds like it would delete any advantage it had over the 
> > slow version that needed locking.
> 
> gup-fast has already to get_page, so I don't see it.

That's my point. It used to have one atomic. Now it has two (and a memory 
barrier). Those tend to be pretty expensive - even when there's no 
cacheline bouncing.

> Furthermore starting from the second access GUP is already
> set

That's a totally bogus argument. It will be true for _benchmarks_, but if 
somebody is trying to avoid buffered IO, one very possible common case is 
that it's all going to be new pages all the time.

That said, I don't know who the crazy O_DIRECT users are. It may be true 
that some O_DIRECT users end up using the same pages over and over again, 
and that this is a good optimization for them.

> > What we _could_ try to do is to always make the COW breaking be a 
> > _directed_ event - we'd make sure that we always break COW in the 
> > direction of the first owner (going to the rmap chains). That might solve 
> > everything, and be purely local to the logic in mm/memory.c (do_wp_page).
> 
> That's a really interesting idea and frankly I didn't think about it.

The advantage of it is that it fixes the problem not just in one place, 
but "forever". No hacks about exactly how you access the mappings etc.

Of course, nothing _really_ solves things. If you do some delayed IO after 
having looked up the mapping and turned it into a physical page, and the 
original allocator actually unmaps it (or exits), then the same issue can 
still happen (well, not the _same_ one - but the very similar issue of the 
child seeing changes even though the IO was started in the parent). 

This is why I think any "look up by physical" is fundamentally flawed. It 
very basically becomes a "I have a secret local TLB that cannot be changed 
or flushed". And any single-bit solution (GUP) is always going to be 
fairly broken. 

> The cost of my fix to fork is not measurable with fork microbenchmark,
> while the cost of finding who owns the original shared page in
> do_wp_page would be potentially be much bigger. The only slowdown to
> fork is in the O_DIRECT slow path which we don't care about and in the
> worst case is limited to the total amount of in-flight I/O.

Agreed. However, I really think this is a O_DIRECT problem. Just document 
it. Tell people that O_DIRECT simply doesn't work with COW, and 
fundamentally can never work well.

If you use O_DIRECT with threading, you had better know what the hell 
you're doing anyway. I do not think that the kernel should do stupid 
things just because stupid users don't understand the semantics of the 
_non-stupid_ thing (which is to just let people think about COW for five 
seconds).

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 20:19               ` Linus Torvalds
@ 2009-03-11 20:33                 ` Linus Torvalds
  2009-03-11 20:55                   ` Andrea Arcangeli
  2009-03-14  5:07                   ` Benjamin Herrenschmidt
  2009-03-11 20:48                 ` Andrea Arcangeli
  2009-03-14  5:06                 ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2009-03-11 20:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm



On Wed, 11 Mar 2009, Linus Torvalds wrote:
> 
> Agreed. However, I really think this is a O_DIRECT problem. Just document 
> it. Tell people that O_DIRECT simply doesn't work with COW, and 
> fundamentally can never work well.
> 
> If you use O_DIRECT with threading, you had better know what the hell 
> you're doing anyway. I do not think that the kernel should do stupid 
> things just because stupid users don't understand the semantics of the 
> _non-stupid_ thing (which is to just let people think about COW for five 
> seconds).

Btw, if we don't do that, then there are better alternatives. One is:

 - fork already always takes the write lock on mmap_sem (and f*ck no, I 
   doubt anybody will ever care one whit how "parallel" you can do forks 
   from threads, so I don't think this is an issue)

 - Just make the rule be that people who use get_user_pages() always 
   have to have the read-lock on mmap_sem until they've used the pages.

We already take the read-lock for the lookup (well, not for the gup, but 
for all the slow cases), but I'm saying that we could go one step further 
- just read-lock over the _whole_ O_DIRECT read or write. That way you 
literally protect against concurrent fork()s.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 20:19               ` Linus Torvalds
  2009-03-11 20:33                 ` Linus Torvalds
@ 2009-03-11 20:48                 ` Andrea Arcangeli
  2009-03-14  5:06                 ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-11 20:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm

On Wed, Mar 11, 2009 at 01:19:03PM -0700, Linus Torvalds wrote:
> That said, I don't know who the crazy O_DIRECT users are. It may be true 
> that some O_DIRECT users end up using the same pages over and over again, 
> and that this is a good optimization for them.

If it's done on new pages chances are that gup-fast fast-path can't
run in the first place, modulo glibc memalign re-using previously
freed areas. Overall I think it's worthwhile optimization, to avoid
the locked op in the rewrite case that I think it's common enough.

But I totally agree that it'd be good to benchmark gup-fast on already
instantiated ptes where SetPageGUP will run. I thought it'd be like
below measurement error and not measurable but good to check it.

> The advantage of it is that it fixes the problem not just in one place, 
> but "forever". No hacks about exactly how you access the mappings etc.
> 
> Of course, nothing _really_ solves things. If you do some delayed IO after 
> having looked up the mapping and turned it into a physical page, and the 
> original allocator actually unmaps it (or exits), then the same issue can 
> still happen (well, not the _same_ one - but the very similar issue of the 
> child seeing changes even though the IO was started in the parent). 
> 
> This is why I think any "look up by physical" is fundamentally flawed. It 
> very basically becomes a "I have a secret local TLB that cannot be changed 
> or flushed". And any single-bit solution (GUP) is always going to be 
> fairly broken. 

One of the reasons of not sharing when PG_gup is set and page_count is
shown as pinned, is also to fix all sort of drivers that are doing gup
to "lookup by physical" on anon pages and doing "dma by physical some
offset of the page" at any time later and fork. Otherwise PageReserved
should be set by default by gup-fast instead of relying on the drivers
to set it after gup-fast returns.

> Agreed. However, I really think this is a O_DIRECT problem. Just document 
> it. Tell people that O_DIRECT simply doesn't work with COW, and 
> fundamentally can never work well.
>
> If you use O_DIRECT with threading, you had better know what the hell 
> you're doing anyway. I do not think that the kernel should do stupid 
> things just because stupid users don't understand the semantics of the 
> _non-stupid_ thing (which is to just let people think about COW for five 
> seconds).

This really isn't only about O_DIRECT. This is to fix gup vs fork,
O_DIRECT is just one of the million of gup users out there... KVM work
around this by using MADV_DONTFORK, until MADV_DONTFORK was introduced
I once started to get corruption in KVM when a change made system() to
be executed once in a while for whatever unrelated reason.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 20:33                 ` Linus Torvalds
@ 2009-03-11 20:55                   ` Andrea Arcangeli
  2009-03-11 21:28                     ` Linus Torvalds
  2009-03-14  5:07                   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-11 20:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm

On Wed, Mar 11, 2009 at 01:33:17PM -0700, Linus Torvalds wrote:
> Btw, if we don't do that, then there are better alternatives. One is:
> 
>  - fork already always takes the write lock on mmap_sem (and f*ck no, I 
>    doubt anybody will ever care one whit how "parallel" you can do forks 
>    from threads, so I don't think this is an issue)
> 
>  - Just make the rule be that people who use get_user_pages() always 
>    have to have the read-lock on mmap_sem until they've used the pages.

How do you handle pages where gup already returned and I/O still in
flight? Forcing gup-fast to be called with mmap_sem already hold (like
gup used to require) only avoids the need of changes in gup-fast
AFAICT. You'll still get pages that are pinned and calling gup-fast
under mmap_sem (no matter if read or even write mode) won't make a
difference, still those pages will be pinned while fork runs and with
dma going to them (by O_DIRECT or some driver using gup, as long as
PageReserved isn't set on them).

> We already take the read-lock for the lookup (well, not for the gup, but 
> for all the slow cases), but I'm saying that we could go one step further 
> - just read-lock over the _whole_ O_DIRECT read or write. That way you 
> literally protect against concurrent fork()s.

Releasing the mmap_sem read mode in the irq-completion handler context
should be possible, however fork will end up throttled blocking for
I/O which isn't very nice behavior. BTW, direct-io.c is a total mess,
I couldn't even figure out where to release those locks in the I/O
completion handlers when I tried something like this with PG_lock
instead of the mmap_sem...  Eventually I gave it up because this isn't
just about O_DIRECT but all gup users have this trouble with fork.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 20:55                   ` Andrea Arcangeli
@ 2009-03-11 21:28                     ` Linus Torvalds
  2009-03-11 21:57                       ` Andrea Arcangeli
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-11 21:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm



On Wed, 11 Mar 2009, Andrea Arcangeli wrote:

> On Wed, Mar 11, 2009 at 01:33:17PM -0700, Linus Torvalds wrote:
> > Btw, if we don't do that, then there are better alternatives. One is:
> > 
> >  - fork already always takes the write lock on mmap_sem (and f*ck no, I 
> >    doubt anybody will ever care one whit how "parallel" you can do forks 
> >    from threads, so I don't think this is an issue)
> > 
> >  - Just make the rule be that people who use get_user_pages() always 
> >    have to have the read-lock on mmap_sem until they've used the pages.
> 
> How do you handle pages where gup already returned and I/O still in
> flight?

The rule is:
 - either keep the mmap_sem for reading until the IO is done
 - admit the fact that IO is asynchronous, and has visible async behavior.

> Forcing gup-fast to be called with mmap_sem already hold (like
> gup used to require) only avoids the need of changes in gup-fast
> AFAICT. You'll still get pages that are pinned and calling gup-fast
> under mmap_sem (no matter if read or even write mode) won't make a
> difference, still those pages will be pinned while fork runs and with
> dma going to them (by O_DIRECT or some driver using gup, as long as
> PageReserved isn't set on them).

The point I'm trying to make is that anybody who thinks that pages are 
stable over various behavior that runs in another thread - be it a fork, a 
mmap/munmap, or anything else, is just fooling themselves. The pages are 
going to show up in "random" places. 

The fact that the non-fast "get_user_pages()" takes the mmap semaphore for 
reading doesn't even protect that. It just means that the pages made sense 
at the time the get_user_pages() happened, not necessarily at the time 
when the actual use of them did. 

> Releasing the mmap_sem read mode in the irq-completion handler context
> should be possible, however fork will end up throttled blocking for
> I/O which isn't very nice behavior. BTW, direct-io.c is a total mess,
> I couldn't even figure out where to release those locks in the I/O
> completion handlers when I tried something like this with PG_lock
> instead of the mmap_sem...  Eventually I gave it up because this isn't
> just about O_DIRECT but all gup users have this trouble with fork.

O_DIRECT is actually the _simple_ case, since we won't be returning until 
it is done (ie it's not actually a async interface). So no, O_DIRECT 
doesn't need any interrupt handler games. It would just need to hold the 
sem over the actual call to the filesystem (ie just over the ->direct_IO() 
call).

Of course, I suspect that all users of O_DIRECT would be _very_ unhappy if 
they cannot do mmap/unmap/brk on other areas while O_DIRECT is going on, 
so it's almost certainly not reasonable.

People want the relaxed synchronization we give them, and that's literally 
why get_user_pages_fast exists - because people don't want _more_ 
synchronization, they want _less_.

But the thing is, with less synchronization, the behavior really is 
surprising in the edge cases. Which is why I think "threaded fork" plus 
"get_user_pages_fast" just doesn't make sense to even _worry_ about. If 
you use O_DIRECT and mix it with fork, you get what you get, and it's 
random - exactly because people who want O_DIRECT don't want any locking. 

It's a user-space issue, not a kernel issue.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 21:28                     ` Linus Torvalds
@ 2009-03-11 21:57                       ` Andrea Arcangeli
  2009-03-11 22:06                         ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-11 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm

On Wed, Mar 11, 2009 at 02:28:08PM -0700, Linus Torvalds wrote:
> The fact that the non-fast "get_user_pages()" takes the mmap semaphore for 
> reading doesn't even protect that. It just means that the pages made sense 
> at the time the get_user_pages() happened, not necessarily at the time 
> when the actual use of them did. 

Indeed this is a generic problem, not specific to
get_user_pages_fast. get_user_pages_fast just adds a few complications
to serialize against.

> O_DIRECT is actually the _simple_ case, since we won't be returning until 
> it is done (ie it's not actually a async interface). So no, O_DIRECT 
> doesn't need any interrupt handler games. It would just need to hold the 
> sem over the actual call to the filesystem (ie just over the ->direct_IO() 
> call).

I don't see how you can solve the race by only holding the sem only
over the direct_IO call (and not until the I/O completion handler
fires). I think to solve the race using mmap_sem only, the bio I/O
completion handler that eventually calls into direct-io.c from irq
context would need to up_read(&mmap_sem).

The way my patch avoids to alter the I/O completion path running from
irq context is by ensuring no I/O is going on at all to the pages that
are being shared with the child, and by ensuring that any gup or
gup-fast will trigger cow before it can write to the shared
page. Pages simply can't be shared before I/O is complete.

> People want the relaxed synchronization we give them, and that's literally 
> why get_user_pages_fast exists - because people don't want _more_ 
> synchronization, they want _less_.
> 
> But the thing is, with less synchronization, the behavior really is 
> surprising in the edge cases. Which is why I think "threaded fork" plus 
> "get_user_pages_fast" just doesn't make sense to even _worry_ about. If 
> you use O_DIRECT and mix it with fork, you get what you get, and it's 
> random - exactly because people who want O_DIRECT don't want any locking. 
> 
> It's a user-space issue, not a kernel issue.

I think your point of view is clear, I sure can write userland code
that copes it the currently altered memory protection semantics of
read vs fork if fd is opened with O_DIRECT or drivers using gup, so
I'll let the userland folks comment on it, some are in CC.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 21:57                       ` Andrea Arcangeli
@ 2009-03-11 22:06                         ` Linus Torvalds
  2009-03-11 22:07                           ` Linus Torvalds
  2009-03-11 22:22                           ` Davide Libenzi
  0 siblings, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2009-03-11 22:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm



On Wed, 11 Mar 2009, Andrea Arcangeli wrote:
> 
> > People want the relaxed synchronization we give them, and that's literally 
> > why get_user_pages_fast exists - because people don't want _more_ 
> > synchronization, they want _less_.
> > 
> > But the thing is, with less synchronization, the behavior really is 
> > surprising in the edge cases. Which is why I think "threaded fork" plus 
> > "get_user_pages_fast" just doesn't make sense to even _worry_ about. If 
> > you use O_DIRECT and mix it with fork, you get what you get, and it's 
> > random - exactly because people who want O_DIRECT don't want any locking. 
> > 
> > It's a user-space issue, not a kernel issue.
> 
> I think your point of view is clear, I sure can write userland code
> that copes it the currently altered memory protection semantics of
> read vs fork if fd is opened with O_DIRECT or drivers using gup, so
> I'll let the userland folks comment on it, some are in CC.

Btw, we could make it easier for people to not screw up.

In particular, "fork()" in a threaded program is almost always wrong. If 
you want to exec another program from a threaded one, you should either 
just do execve() (which kills all threads) or you should do vfork+execve 
(which has none of the COW issues).

An we could add a warning for it. Something like "if this is a threaded 
program, and it has ever used get_user_pages(), and it does a fork(), warn 
about it once". Maybe people would realize what a stupid thing they are 
doing, and that there is a simple fix (vfork).

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 22:06                         ` Linus Torvalds
@ 2009-03-11 22:07                           ` Linus Torvalds
  2009-03-11 22:22                           ` Davide Libenzi
  1 sibling, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2009-03-11 22:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, linux-mm



On Wed, 11 Mar 2009, Linus Torvalds wrote:
> 
> An we could add a warning for it. Something like "if this is a threaded 
> program, and it has ever used get_user_pages(), and it does a fork(), warn 
> about it once". Maybe people would realize what a stupid thing they are 
> doing, and that there is a simple fix (vfork).

Ehh. vfork is only simple if you literally are going to execve. If you are 
using a fork as some kind of odd way to snapshot, I don't know what you 
should do. You can't sanely snapshot a threaded app with fork, but I bet 
some people try.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 22:06                         ` Linus Torvalds
  2009-03-11 22:07                           ` Linus Torvalds
@ 2009-03-11 22:22                           ` Davide Libenzi
  2009-03-11 22:32                             ` Linus Torvalds
  1 sibling, 1 reply; 83+ messages in thread
From: Davide Libenzi @ 2009-03-11 22:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Wed, 11 Mar 2009, Linus Torvalds wrote:

> In particular, "fork()" in a threaded program is almost always wrong. If 
> you want to exec another program from a threaded one, you should either 
> just do execve() (which kills all threads) or you should do vfork+execve 
> (which has none of the COW issues).

Didn't follow the lengthy thread, but if we make fork+exec to fail inside 
a threaded program, we might end up making a lot of people unhappy.


- Davide


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 22:22                           ` Davide Libenzi
@ 2009-03-11 22:32                             ` Linus Torvalds
  0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2009-03-11 22:32 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm



On Wed, 11 Mar 2009, Davide Libenzi wrote:
> 
> Didn't follow the lengthy thread, but if we make fork+exec to fail inside 
> a threaded program, we might end up making a lot of people unhappy.

Yeah, no, we don't want to fail it, but we could do a one-time warning or 
something, to at least see who does it and perhaps see if some of them 
might realize the problems.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 18:46         ` Linus Torvalds
  2009-03-11 19:01           ` Linus Torvalds
  2009-03-11 19:06           ` Andrea Arcangeli
@ 2009-03-12  5:36           ` Nick Piggin
  2009-03-12 16:23             ` Nick Piggin
  2 siblings, 1 reply; 83+ messages in thread
From: Nick Piggin @ 2009-03-12  5:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Thursday 12 March 2009 05:46:17 Linus Torvalds wrote:
> On Wed, 11 Mar 2009, Andrea Arcangeli wrote:

> > > The rule has always been: don't mix fork() with page pinning. It
> > > doesn't work. It never worked. It likely never will.
> >
> > I never heard this rule here
>
> It's never been written down, but it's obvious to anybody who looks at how
> COW works for even five seconds. The fact is, the person doing the COW
> after a fork() is the person who no longer has the same physical page
> (because he got a new page).
>
> So _anything- that depends on physical addresses simply _cannot_ work
> concurrently with a fork. That has always been true.
>
> If the idiots who use O_DIRECT don't understand that, then hey, it's their
> problem. I have long been of the opinion that we should not support
> O_DIRECT at all, and that it's a totally broken premise to start with.
>
> This is just one of millions of reasons.

Well it is a quite well known issue at this stage I think. We've had
MADV_DONTFORK since 2.6.16 which is basically to solve this issue I
think with infiniband library. I guess if it would be really helpful
we *could* add MADV_DONTCOW.

Assuming we want to try fixing it transparently... what about another
approach, mark a vma as VM_DONTCOW and uncow all existing pages in it
if it ever has get_user_pages run on it. Big hammer approach.

fast gup would be a little bit harder because looking up the vma
defeats the purpose. However if we use another page bit to say the
page belongs to a VM_DONTCOW vma, then we only need to check that
once and fall back to slow gup if it is clear. So there would be no
extra atomics in the repeat case. Yes it would be slower, but apps
that really care should know what they are doing and set
MADV_DONTFORK or MADV_DONTCOW on the vma by hand before doing the
zero copy IO.

Would this work? Anyone see any holes? (I imagine someone might argue
against big hammer, but I would prefer it if it is lighter impact on
the VM and still allows good applications to avoid the hammer)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-12  5:36           ` Nick Piggin
@ 2009-03-12 16:23             ` Nick Piggin
  2009-03-12 17:00               ` Andrea Arcangeli
  0 siblings, 1 reply; 83+ messages in thread
From: Nick Piggin @ 2009-03-12 16:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Thursday 12 March 2009 16:36:18 Nick Piggin wrote:

> Assuming we want to try fixing it transparently... what about another
> approach, mark a vma as VM_DONTCOW and uncow all existing pages in it
> if it ever has get_user_pages run on it. Big hammer approach.
>
> fast gup would be a little bit harder because looking up the vma
> defeats the purpose. However if we use another page bit to say the
> page belongs to a VM_DONTCOW vma, then we only need to check that
> once and fall back to slow gup if it is clear. So there would be no
> extra atomics in the repeat case. Yes it would be slower, but apps
> that really care should know what they are doing and set
> MADV_DONTFORK or MADV_DONTCOW on the vma by hand before doing the
> zero copy IO.
>
> Would this work? Anyone see any holes? (I imagine someone might argue
> against big hammer, but I would prefer it if it is lighter impact on
> the VM and still allows good applications to avoid the hammer)

OK, this is as far as I got tonight.

This passes Andrea's dma_thread test case. I haven't started hugepages,
and it isn't quite right to drop the mmap_sem and retake it for write
in get_user_pages (firstly, caller might hold mmap_sem for write,
secondly, it may not be able to tolerate mmap_sem being dropped).

Annoying that it has to take mmap_sem for write to add this bit to
vm_flags. Possibly we could use a different way to signal it is a
"dontcow" vma... something in anon_vma maybe?

Anyway, before worrying too much more about those details, I'll post
it. It is a different approach that I think might be worth consideration.
Comments?

Thanks,
Nick
--
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-03-13 03:00:58.000000000 +1100
+++ linux-2.6/include/linux/mm.h	2009-03-13 03:05:00.000000000 +1100
@@ -104,6 +104,7 @@ extern unsigned int kobjsize(const void 
 #define VM_CAN_NONLINEAR 0x08000000	/* Has ->fault & does nonlinear pages */
 #define VM_MIXEDMAP	0x10000000	/* Can contain "struct page" and pure PFN pages */
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
+#define VM_DONTCOW	0x40000000	/* Contains no COW pages (copies on fork) */
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -789,7 +790,7 @@ int walk_page_range(unsigned long addr, 
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+		struct vm_area_struct *dst_vma, struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2009-03-13 03:00:58.000000000 +1100
+++ linux-2.6/mm/memory.c	2009-03-13 03:07:52.000000000 +1100
@@ -580,7 +580,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
+		if (likely(!(vm_flags & VM_DONTCOW)))
+			ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
 
@@ -594,6 +595,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
+		VM_BUG_ON(PageDontCOW(page) && !(vm_flags & VM_DONTCOW));
 		get_page(page);
 		page_dup_rmap(page, vma, addr);
 		rss[!!PageAnon(page)]++;
@@ -696,8 +698,10 @@ static inline int copy_pud_range(struct 
 	return 0;
 }
 
+static int decow_page_range(struct mm_struct *mm, struct vm_area_struct *vma);
+
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *dst_vma, struct vm_area_struct *vma)
 {
 	pgd_t *src_pgd, *dst_pgd;
 	unsigned long next;
@@ -755,6 +759,15 @@ int copy_page_range(struct mm_struct *ds
 	if (is_cow_mapping(vma->vm_flags))
 		mmu_notifier_invalidate_range_end(src_mm,
 						  vma->vm_start, end);
+
+	WARN_ON(ret);
+	if (unlikely(vma->vm_flags & VM_DONTCOW) && !ret) {
+		if (decow_page_range(dst_mm, dst_vma))
+			ret = -ENOMEM;
+		/* child doesn't really need VM_DONTCOW after being de-COWed */
+		// dst_vma->vm_flags &= ~VM_DONTCOW;
+	}
+
 	return ret;
 }
 
@@ -1200,6 +1213,7 @@ static inline int use_zero_page(struct v
 }
 
 
+static int make_vma_nocow(struct vm_area_struct *vma);
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, int flags,
@@ -1273,6 +1287,23 @@ int __get_user_pages(struct task_struct 
 		    (!ignore && !(vm_flags & vma->vm_flags)))
 			return i ? : -EFAULT;
 
+		if (!(flags & GUP_FLAGS_STACK) &&
+				is_cow_mapping(vma->vm_flags) &&
+				!(vma->vm_flags & VM_DONTCOW)) {
+			up_read(&mm->mmap_sem);
+			down_write(&mm->mmap_sem);
+			vma = find_vma(mm, start);
+			if (vma && is_cow_mapping(vma->vm_flags) &&
+				!(vma->vm_flags & VM_DONTCOW)) {
+				if (make_vma_nocow(vma)) {
+					downgrade_write(&mm->mmap_sem);
+					return i ? : -ENOMEM;
+				}
+			}
+			downgrade_write(&mm->mmap_sem);
+			continue;
+		}
+
 		if (is_vm_hugetlb_page(vma)) {
 			i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &len, i, write);
@@ -1910,6 +1941,8 @@ static int do_wp_page(struct mm_struct *
 		goto gotten;
 	}
 
+	VM_BUG_ON(PageDontCOW(old_page));
+
 	/*
 	 * Take out anonymous pages first, anonymous shared vmas are
 	 * not dirty accountable.
@@ -2102,6 +2135,232 @@ unwritable_page:
 	return VM_FAULT_SIGBUS;
 }
 
+static int decow_one_pte(struct mm_struct *mm, pte_t *ptep, pmd_t *pmd,
+			spinlock_t *ptl, struct vm_area_struct *vma,
+			unsigned long address)
+{
+	pte_t pte = *ptep;
+	struct page *page, *new_page;
+	int ret = 0;
+
+	/* pte contains position in swap or file, so don't do anything */
+	if (unlikely(!pte_present(pte)))
+		return 0;
+	/* pte is writable, can't be COW */
+	if (pte_write(pte))
+		return 0;
+
+	page = vm_normal_page(vma, address, pte);
+	if (!page)
+		return 0;
+
+	if (!PageAnon(page))
+		return 0;
+
+	page_cache_get(page);
+
+	pte_unmap_unlock(pte, ptl);
+
+	if (unlikely(anon_vma_prepare(vma)))
+		goto oom;
+	VM_BUG_ON(page == ZERO_PAGE(0));
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	if (!new_page)
+		goto oom;
+	/*
+	 * Don't let another task, with possibly unlocked vma,
+	 * keep the mlocked page.
+	 */
+	if (vma->vm_flags & VM_LOCKED) {
+		lock_page(page);	/* for LRU manipulation */
+		clear_page_mlock(page);
+		unlock_page(page);
+	}
+	cow_user_page(new_page, page, address, vma);
+	__SetPageUptodate(new_page);
+	__SetPageDontCOW(new_page);
+
+	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
+		goto oom_free_new;
+
+	/*
+	 * Re-check the pte - we dropped the lock
+	 */
+	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (likely(pte_same(*ptep, pte))) {
+		pte_t entry;
+
+		flush_cache_page(vma, address, pte_pfn(pte));
+		entry = mk_pte(new_page, vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		/*
+		 * Clear the pte entry and flush it first, before updating the
+		 * pte with the new entry. This will avoid a race condition
+		 * seen in the presence of one thread doing SMC and another
+		 * thread doing COW.
+		 */
+		ptep_clear_flush_notify(vma, address, ptep);
+		page_add_new_anon_rmap(new_page, vma, address);
+		set_pte_at(mm, address, ptep, entry);
+
+		/* See comment in do_wp_page */
+		page_remove_rmap(page);
+	} else {
+		mem_cgroup_uncharge_page(new_page);
+		page_cache_release(new_page);
+		ret = -EAGAIN;
+	}
+
+	page_cache_release(page);
+
+	return ret;
+
+oom_free_new:
+	page_cache_release(new_page);
+oom:
+	page_cache_release(page);
+	return -ENOMEM;
+}
+
+static int decow_pte_range(struct mm_struct *mm,
+			pmd_t *pmd, struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+	spinlock_t *ptl;
+	int progress = 0;
+	int ret = 0;
+
+again:
+	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+//	arch_enter_lazy_mmu_mode();
+
+	do {
+		/*
+		 * We are holding two locks at this point - either of them
+		 * could generate latencies in another task on another CPU.
+		 */
+		if (progress >= 32) {
+			progress = 0;
+			if (need_resched() || spin_needbreak(ptl))
+				break;
+		}
+		if (pte_none(*pte)) {
+			progress++;
+			continue;
+		}
+		ret = decow_one_pte(mm, pte, pmd, ptl, vma, addr);
+		if (ret) {
+			if (ret == -EAGAIN) { /* retry */
+				ret = 0;
+				break;
+			}
+			goto out;
+		}
+		progress += 8;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+//	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+	if (addr != end)
+		goto again;
+out:
+	return ret;
+}
+
+static int decow_pmd_range(struct mm_struct *mm,
+			pud_t *pud, struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end)
+{
+	pmd_t *pmd;
+	unsigned long next;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		if (decow_pte_range(mm, pmd, vma, addr, next))
+			return -ENOMEM;
+	} while (pmd++, addr = next, addr != end);
+	return 0;
+}
+
+static int decow_pud_range(struct mm_struct *mm,
+			pgd_t *pgd, struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end)
+{
+	pud_t *pud;
+	unsigned long next;
+
+	pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		if (decow_pmd_range(mm, pud, vma, addr, next))
+			return -ENOMEM;
+	} while (pud++, addr = next, addr != end);
+	return 0;
+}
+
+static noinline int decow_page_range(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	unsigned long addr = vma->vm_start;
+	unsigned long end = vma->vm_end;
+	int ret;
+
+	BUG_ON(!is_cow_mapping(vma->vm_flags));
+
+//	if (is_vm_hugetlb_page(vma))
+//		return decow_hugetlb_page_range(mm, vma);
+
+	mmu_notifier_invalidate_range_start(mm, addr, end);
+
+	ret = 0;
+	pgd = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		if (unlikely(decow_pud_range(mm, pgd, vma, addr, next))) {
+			ret = -ENOMEM;
+			break;
+		}
+	} while (pgd++, addr = next, addr != end);
+
+	mmu_notifier_invalidate_range_end(mm, vma->vm_start, end);
+
+	return ret;
+}
+
+/*
+ * Turns the anonymous VMA into a "nocow" vma. De-cow existing COW pages.
+ * Must hold mmap_sem for write.
+ */
+static int make_vma_nocow(struct vm_area_struct *vma)
+{
+	static DEFINE_MUTEX(lock);
+	struct mm_struct *mm = vma->vm_mm;
+	int ret;
+
+	mutex_lock(&lock);
+	if (vma->vm_flags & VM_DONTCOW) {
+		mutex_unlock(&lock);
+		return 0;
+	}
+
+	ret = decow_page_range(mm, vma);
+	if (!ret)
+		vma->vm_flags |= VM_DONTCOW;
+	mutex_unlock(&lock);
+
+	return ret;
+}
+
 /*
  * Helper functions for unmap_mapping_range().
  *
@@ -2433,6 +2692,9 @@ static int do_swap_page(struct mm_struct
 		count_vm_event(PGMAJFAULT);
 	}
 
+	if (unlikely(vma->vm_flags & VM_DONTCOW))
+		SetPageDontCOW(page);
+
 	mark_page_accessed(page);
 
 	lock_page(page);
@@ -2530,6 +2792,8 @@ static int do_anonymous_page(struct mm_s
 	if (!page)
 		goto oom;
 	__SetPageUptodate(page);
+	if (unlikely(vma->vm_flags & VM_DONTCOW))
+		__SetPageDontCOW(page);
 
 	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
 		goto oom_free_page;
@@ -2636,6 +2900,8 @@ static int __do_fault(struct mm_struct *
 				clear_page_mlock(vmf.page);
 			copy_user_highpage(page, vmf.page, address, vma);
 			__SetPageUptodate(page);
+			if (unlikely(vma->vm_flags & VM_DONTCOW))
+				__SetPageDontCOW(page);
 		} else {
 			/*
 			 * If the page will be shareable, see if the backing
@@ -2935,8 +3201,9 @@ int make_pages_present(unsigned long add
 	BUG_ON(addr >= end);
 	BUG_ON(end > vma->vm_end);
 	len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;
-	ret = get_user_pages(current, current->mm, addr,
-			len, write, 0, NULL, NULL);
+	ret = __get_user_pages(current, current->mm, addr,
+			len, GUP_FLAGS_STACK | (write ? GUP_FLAGS_WRITE : 0),
+			NULL, NULL);
 	if (ret < 0)
 		return ret;
 	return ret == len ? 0 : -EFAULT;
@@ -3085,8 +3352,9 @@ int access_process_vm(struct task_struct
 		void *maddr;
 		struct page *page = NULL;
 
-		ret = get_user_pages(tsk, mm, addr, 1,
-				write, 1, &page, &vma);
+		ret = __get_user_pages(tsk, mm, addr, 1,
+				GUP_FLAGS_FORCE | GUP_FLAGS_STACK |
+				(write ? GUP_FLAGS_WRITE : 0), &page, &vma);
 		if (ret <= 0) {
 			/*
 			 * Check if this is a VM_IO | VM_PFNMAP VMA, which
Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/gup.c	2009-03-13 03:00:58.000000000 +1100
+++ linux-2.6/arch/x86/mm/gup.c	2009-03-13 03:01:03.000000000 +1100
@@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t 
 		struct page *page;
 
 		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+failed:
 			pte_unmap(ptep);
 			return 0;
 		}
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
+		if (unlikely(!PageDontCOW(page)))
+			goto failed;
 		get_page(page);
 		pages[*nr] = page;
 		(*nr)++;
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2009-03-13 03:00:58.000000000 +1100
+++ linux-2.6/include/linux/page-flags.h	2009-03-13 03:01:03.000000000 +1100
@@ -94,6 +94,7 @@ enum pageflags {
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_buddy,		/* Page is free, on buddy lists */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
+	PG_dontcow,		/* PageAnon page in a VM_DONTCOW vma */
 #ifdef CONFIG_UNEVICTABLE_LRU
 	PG_unevictable,		/* Page is "unevictable"  */
 	PG_mlocked,		/* Page is vma mlocked */
@@ -208,6 +209,8 @@ __PAGEFLAG(SlubDebug, slub_debug)
  */
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
 __PAGEFLAG(Buddy, buddy)
+__PAGEFLAG(DontCOW, dontcow)
+SETPAGEFLAG(DontCOW, dontcow)
 PAGEFLAG(MappedToDisk, mappedtodisk)
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-03-13 03:00:58.000000000 +1100
+++ linux-2.6/mm/page_alloc.c	2009-03-13 03:01:03.000000000 +1100
@@ -1000,6 +1000,7 @@ static void free_hot_cold_page(struct pa
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 
+	__ClearPageDontCOW(page);
 	if (PageAnon(page))
 		page->mapping = NULL;
 	if (free_pages_check(page))
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-03-13 03:04:33.000000000 +1100
+++ linux-2.6/kernel/fork.c	2009-03-13 03:05:00.000000000 +1100
@@ -353,7 +353,7 @@ static int dup_mmap(struct mm_struct *mm
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		retval = copy_page_range(mm, oldmm, tmp, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c	2009-03-13 03:04:33.000000000 +1100
+++ linux-2.6/fs/exec.c	2009-03-13 03:05:00.000000000 +1100
@@ -165,6 +165,13 @@ exit:
 
 #ifdef CONFIG_MMU
 
+#define GUP_FLAGS_WRITE                  0x01
+#define GUP_FLAGS_STACK                  0x10
+
+int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		     unsigned long start, int len, int flags,
+		     struct page **pages, struct vm_area_struct **vmas);
+
 static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
 		int write)
 {
@@ -178,8 +185,11 @@ static struct page *get_arg_page(struct 
 			return NULL;
 	}
 #endif
-	ret = get_user_pages(current, bprm->mm, pos,
-			1, write, 1, &page, NULL);
+	down_read(&bprm->mm->mmap_sem);
+	ret = __get_user_pages(current, bprm->mm, pos,
+			1, GUP_FLAGS_STACK | (write ? GUP_FLAGS_WRITE : 0),
+			&page, NULL);
+	up_read(&bprm->mm->mmap_sem);
 	if (ret <= 0)
 		return NULL;
 
Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h	2009-03-13 03:04:33.000000000 +1100
+++ linux-2.6/mm/internal.h	2009-03-13 03:05:00.000000000 +1100
@@ -273,10 +273,11 @@ static inline void mminit_validate_memmo
 }
 #endif /* CONFIG_SPARSEMEM */
 
-#define GUP_FLAGS_WRITE                  0x1
-#define GUP_FLAGS_FORCE                  0x2
-#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
-#define GUP_FLAGS_IGNORE_SIGKILL         0x8
+#define GUP_FLAGS_WRITE                  0x01
+#define GUP_FLAGS_FORCE                  0x02
+#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x04
+#define GUP_FLAGS_IGNORE_SIGKILL         0x08
+#define GUP_FLAGS_STACK                  0x10
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, int flags,
\0



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-12 16:23             ` Nick Piggin
@ 2009-03-12 17:00               ` Andrea Arcangeli
  2009-03-12 17:20                 ` Nick Piggin
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-12 17:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Fri, Mar 13, 2009 at 03:23:40AM +1100, Nick Piggin wrote:
> OK, this is as far as I got tonight.
> 
> This passes Andrea's dma_thread test case. I haven't started hugepages,
> and it isn't quite right to drop the mmap_sem and retake it for write
> in get_user_pages (firstly, caller might hold mmap_sem for write,
> secondly, it may not be able to tolerate mmap_sem being dropped).

What's the point? I mean this will simply work worse than my patch
because it'll have to don't-cow the whole range regardless if it's
pinned or not. Which will slowdown fork in the O_DIRECT case even
more, for no good reason. I thought the complaint here was only a
beauty issue of not wanting to add a function called fork_pre_cow or
your equivalent decow_one_pte in the fork path, not any practical
issue with my patch which already passed all sort of regression
testing and performance valuations. Plus you still have a per-page
bitflag, and I think you have implementation issues in the patch (the
parent pte can't be left writeable if you are in a don't-cow vma, or
the copy will not be atomic, and glibc will have no chance to fix its
bugs). You're not removing the fork_pre_cow logic from fork, so I can
only see it as a regression to make the logic less granular in the
vma.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-12 17:00               ` Andrea Arcangeli
@ 2009-03-12 17:20                 ` Nick Piggin
  2009-03-12 17:23                   ` Nick Piggin
  2009-03-12 18:06                   ` Andrea Arcangeli
  0 siblings, 2 replies; 83+ messages in thread
From: Nick Piggin @ 2009-03-12 17:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Friday 13 March 2009 04:00:11 Andrea Arcangeli wrote:
> On Fri, Mar 13, 2009 at 03:23:40AM +1100, Nick Piggin wrote:
> > OK, this is as far as I got tonight.
> >
> > This passes Andrea's dma_thread test case. I haven't started hugepages,
> > and it isn't quite right to drop the mmap_sem and retake it for write
> > in get_user_pages (firstly, caller might hold mmap_sem for write,
> > secondly, it may not be able to tolerate mmap_sem being dropped).
>
> What's the point?

Well the main point is to avoid atomics and barriers and stuff like
that especially in the fast gup path. It also seems very much smaller
(the vast majority of the change is the addition of decow function).


> I mean this will simply work worse than my patch
> because it'll have to don't-cow the whole range regardless if it's
> pinned or not. Which will slowdown fork in the O_DIRECT case even
> more, for no good reason. 

Hmm, maybe. It probably can possibly work entirely without the vm_flag
and just use the page flag, however. Yes I think it could, and that
might just avoid the whole problem of modifying vm_flags in gup. I'll
have to consider it more tomorrow.

But this case is just if we want to transparently support this without
too much intrusive. Apps that know and care very much could use
MADV_DONTFORK to avoid the copy completely.


> I thought the complaint here was only a
> beauty issue of not wanting to add a function called fork_pre_cow or
> your equivalent decow_one_pte in the fork path, not any practical
> issue with my patch which already passed all sort of regression
> testing and performance valuations.

My complaint is not decow / pre cow (I think I suggested it as the
fix for the problem in the first place). I think the patch is quite
complex and is quite a slowdown for fast gup (especially with
hugepages). I'm just trying to explore different approach.


> Plus you still have a per-page
> bitflag,

Sure. It's the atomic operations which I want to try to minimise.


> and I think you have implementation issues in the patch (the
> parent pte can't be left writeable if you are in a don't-cow vma, or
> the copy will not be atomic, and glibc will have no chance to fix its
> bugs)

Oh, we need to do that? OK, then just take out that statement, and
change VM_BUG_ON(PageDontCOW()) in do_wp_page to
VM_BUG_ON(PageDontCOW() && !reuse);

> . You're not removing the fork_pre_cow logic from fork, so I can
> only see it as a regression to make the logic less granular in the
> vma.

I'll see if it can be made per-page. But I still don't know if it
is a big problem. It's hard to know exactly what crazy things apps
require to be fast.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-12 17:20                 ` Nick Piggin
@ 2009-03-12 17:23                   ` Nick Piggin
  2009-03-12 18:06                   ` Andrea Arcangeli
  1 sibling, 0 replies; 83+ messages in thread
From: Nick Piggin @ 2009-03-12 17:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Friday 13 March 2009 04:20:27 Nick Piggin wrote:
> On Friday 13 March 2009 04:00:11 Andrea Arcangeli wrote:

> > and I think you have implementation issues in the patch (the
> > parent pte can't be left writeable if you are in a don't-cow vma, or
> > the copy will not be atomic, and glibc will have no chance to fix its
> > bugs)
>
> Oh, we need to do that? OK, then just take out that statement, and

Should read: "take out that *if* statement" (the one which I put in to
avoid wrprotect in the parent)

> change VM_BUG_ON(PageDontCOW()) in do_wp_page to
> VM_BUG_ON(PageDontCOW() && !reuse);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-12 17:20                 ` Nick Piggin
  2009-03-12 17:23                   ` Nick Piggin
@ 2009-03-12 18:06                   ` Andrea Arcangeli
  2009-03-12 18:58                     ` Andrea Arcangeli
  2009-03-13 16:09                     ` Nick Piggin
  1 sibling, 2 replies; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-12 18:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Fri, Mar 13, 2009 at 04:20:27AM +1100, Nick Piggin wrote:
> Well the main point is to avoid atomics and barriers and stuff like
> that especially in the fast gup path. It also seems very much smaller
> (the vast majority of the change is the addition of decow function).

Well if you remove the hugetlb part and you remove the pass of src/dst
vma that is needed anyway to fix PAT bugs, my patch will get quite
smaller too.

Agree about the gup-fast path, but frankly I miss how you avoid having
to change gup-fast... I wanted to asked about that...

> Hmm, maybe. It probably can possibly work entirely without the vm_flag
> and just use the page flag, however. Yes I think it could, and that

Right I only use the page flag, and you seem to have a page flag
PG_dontcow too after all.

> might just avoid the whole problem of modifying vm_flags in gup. I'll
> have to consider it more tomorrow.

Ok.

> But this case is just if we want to transparently support this without
> too much intrusive. Apps that know and care very much could use
> MADV_DONTFORK to avoid the copy completely.

Well those apps aren't the problem.

> My complaint is not decow / pre cow (I think I suggested it as the
> fix for the problem in the first place). I think the patch is quite

I'm sure that's not your complaint right. I thought it was the primary
complaint in discussion so far though.

> complex and is quite a slowdown for fast gup (especially with
> hugepages). I'm just trying to explore different approach.

I think we could benchmark this. Also once I'll get how you avoid to
touch gup-fast fast path, without sending a flood of ipis in fork,
I'll understand better how your patch work.

> Oh, we need to do that? OK, then just take out that statement, and
> change VM_BUG_ON(PageDontCOW()) in do_wp_page to
> VM_BUG_ON(PageDontCOW() && !reuse);

Not sure how do_wp_page is relevant, the problem I pointed out is in
the fork_pre_cow/decow_pte only. If do_wp_page runs it means the page
was already wrprotected in the parent or it couldn't be shared, no
problem in do_wp_page in that respect.

The only thing required is that cow_user_page is copying a page that
can't be modified by the parent thread pool during the copy. So
marking parent pte wrprotected and flushing tlb is required. Then
after the copy like in my fork_pre_cow we set the parent pte writable
again. BTW, I start to think I forgot a tlb flush after setting the
pte writable again, that could generate a minor fault that we can
avoid by flushing the tlb, right? But this is a minor thing, and it'd
only trigger if parent only reads the parent pte, otherwise the parent
thread will wait fork in mmap_sem if it did a write, or it won't have
the tlb loaded in the first place if it didn't touch the page while
the pte was temporarily wrprotected.

> I'll see if it can be made per-page. But I still don't know if it
> is a big problem. It's hard to know exactly what crazy things apps
> require to be fast.

The thing is quite simple, if an app has a 1G of vma loaded, you'll
allocate 1G of ram for no good reason. It can even OOM, it's not just
a performance issue. While doing it per-page like I do, won't be
noticeable, as the in-flight I/O will be minor.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-12 18:06                   ` Andrea Arcangeli
@ 2009-03-12 18:58                     ` Andrea Arcangeli
  2009-03-13 16:09                     ` Nick Piggin
  1 sibling, 0 replies; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-12 18:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Thu, Mar 12, 2009 at 07:06:48PM +0100, Andrea Arcangeli wrote:
> again. BTW, I start to think I forgot a tlb flush after setting the
> pte writable again, that could generate a minor fault that we can
> avoid by flushing the tlb, right? But this is a minor thing, and it'd

Ah no, that is already taken care of by the fork flush in the parent
before returning, so no problem (and it would have been a minor thing
anyway).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-12 18:06                   ` Andrea Arcangeli
  2009-03-12 18:58                     ` Andrea Arcangeli
@ 2009-03-13 16:09                     ` Nick Piggin
  2009-03-13 19:34                       ` Andrea Arcangeli
  2009-03-14  4:46                       ` Nick Piggin
  1 sibling, 2 replies; 83+ messages in thread
From: Nick Piggin @ 2009-03-13 16:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Friday 13 March 2009 05:06:48 Andrea Arcangeli wrote:
> On Fri, Mar 13, 2009 at 04:20:27AM +1100, Nick Piggin wrote:
> > Well the main point is to avoid atomics and barriers and stuff like
> > that especially in the fast gup path. It also seems very much smaller
> > (the vast majority of the change is the addition of decow function).
>
> Well if you remove the hugetlb part and you remove the pass of src/dst
> vma that is needed anyway to fix PAT bugs, my patch will get quite
> smaller too.

Possibly true. OK, it wasn't a very good argument to compare my incomplete,
RFC patch based on size alone :)


> Agree about the gup-fast path, but frankly I miss how you avoid having
> to change gup-fast... I wanted to asked about that...

It is more straightforward than your version because it does not try to
make the page re-cow-able again after the GUP is finished. The main
conceptual difference between our fixes I think (ignoring my silly
vma-wide decow), is this issue.

Of course I could have a race in fast-gup, but I don't think I can see
one. I'm working on removing the vma stuff and just making it per-page,
which might make it easier to review.


> > Oh, we need to do that? OK, then just take out that statement, and
> > change VM_BUG_ON(PageDontCOW()) in do_wp_page to
> > VM_BUG_ON(PageDontCOW() && !reuse);
>
> Not sure how do_wp_page is relevant, the problem I pointed out is in
> the fork_pre_cow/decow_pte only. If do_wp_page runs it means the page
> was already wrprotected in the parent or it couldn't be shared, no
> problem in do_wp_page in that respect.

Well, it would save having to touch the parent's pagetables after
doing the atomic copy-on-fork in the child. Just have the parent do
a do_wp_page, which will notice it is the only user of the page and
reuse it rather than COW it (now that Hugh has fixed the races in
the reuse check that should be fine).


> The only thing required is that cow_user_page is copying a page that
> can't be modified by the parent thread pool during the copy. So
> marking parent pte wrprotected and flushing tlb is required. Then
> after the copy like in my fork_pre_cow we set the parent pte writable
> again.

Yes you could do it this way too, I'm not sure which way is better...
I'll have to take another look at it after removing the per-vma code
from mine.

> > I'll see if it can be made per-page. But I still don't know if it
> > is a big problem. It's hard to know exactly what crazy things apps
> > require to be fast.
>
> The thing is quite simple, if an app has a 1G of vma loaded, you'll
> allocate 1G of ram for no good reason. It can even OOM, it's not just
> a performance issue. While doing it per-page like I do, won't be
> noticeable, as the in-flight I/O will be minor.

Yes I agree now it is a silly way to do it.

Now I also see that your patch still hasn't covered the other side of
the race, wheras my scheme should do. Hmm, I think that if we want to
go to the extent of adding all this code in and tell userspace apps
they can use zerocopy IO and not care about COW, then we really must
cover both sides of the race otherwise it is just asking for data
corruption.

Conversely, if we leave *any* holes open by design, then we may as well
leave *all* holes open and have simpler code -- because apps will have
to know about the zerocopy vs COW problem anyway. Don't you agree?

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-13 16:09                     ` Nick Piggin
@ 2009-03-13 19:34                       ` Andrea Arcangeli
  2009-03-14  4:59                         ` Nick Piggin
  2009-03-14  4:46                       ` Nick Piggin
  1 sibling, 1 reply; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-13 19:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Sat, Mar 14, 2009 at 03:09:39AM +1100, Nick Piggin wrote:
> Of course I could have a race in fast-gup, but I don't think I can see
> one. I'm working on removing the vma stuff and just making it per-page,
> which might make it easier to review.

If you didn't touch gup-fast and you don't send ipis in fork, you most
certainly have one, it's the one Linus pointed out and that I've fixed
(with Izik, then I sorted out the ordering details and how to make it
safe on frok side).

> Well, it would save having to touch the parent's pagetables after
> doing the atomic copy-on-fork in the child. Just have the parent do
> a do_wp_page, which will notice it is the only user of the page and
> reuse it rather than COW it (now that Hugh has fixed the races in
> the reuse check that should be fine).

If we're into the trouble path, it means parent already owns the
page. I just leave it owned to the parent, pte remains the same before
and after fork. No point in changing the pte value if we're in the
troublesome path as far as I can tell. I only verify that the parent pte
didn't go away from under fork when I temporarily release the parent
PT lock to allocate the cow page in the slow path (see the -EAGAIN
path, I also verified it triggers with swapping and system survives fine ;).

> Now I also see that your patch still hasn't covered the other side of
> the race, wheras my scheme should do. Hmm, I think that if we want to

Sorry, but can you elaborate again what the other side of the race is?

If child gets a whole new page, and parent keeps its own page with pte
marked read-write the whole time that a page fault can run (page fault
takes mmap_sem, all we have to protect against when temporarily
releasing parent PT lock is the VM rmap code and that is taken care of
by the pte_same path), so I don't see any other side of the race...

> go to the extent of adding all this code in and tell userspace apps
> they can use zerocopy IO and not care about COW, then we really must
> cover both sides of the race otherwise it is just asking for data
> corruption.

Surely I agree if there's another side of the race left uncovered by
my patch we've to address it too if we make any change and we don't
consider this a 'feature'!

> Conversely, if we leave *any* holes open by design, then we may as well
> leave *all* holes open and have simpler code -- because apps will have
> to know about the zerocopy vs COW problem anyway. Don't you agree?

Indeed ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-13 16:09                     ` Nick Piggin
  2009-03-13 19:34                       ` Andrea Arcangeli
@ 2009-03-14  4:46                       ` Nick Piggin
  2009-03-14  5:06                         ` Nick Piggin
  1 sibling, 1 reply; 83+ messages in thread
From: Nick Piggin @ 2009-03-14  4:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Saturday 14 March 2009 03:09:39 Nick Piggin wrote:
> On Friday 13 March 2009 05:06:48 Andrea Arcangeli wrote:

> > The thing is quite simple, if an app has a 1G of vma loaded, you'll
> > allocate 1G of ram for no good reason. It can even OOM, it's not just
> > a performance issue. While doing it per-page like I do, won't be
> > noticeable, as the in-flight I/O will be minor.
>
> Yes I agree now it is a silly way to do it.

Here is an updated patch that just does it on a per-page basis.
Actually it is still a bit sloppy because I just reused some code
from my last patch for the decow logic... possibly I can just use
the same precow code that you do for small and huge pages (although
Linus didn't like it so much... it is very hard to do nicely right
down there in the call chain :()

Anyway, ignoring the decow implementation (that's not really the
interesting part of the patch), I think this is looking pretty good
now.
---
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-03-14 02:48:06.000000000 +1100
+++ linux-2.6/include/linux/mm.h	2009-03-14 15:12:13.000000000 +1100
@@ -789,7 +789,7 @@ int walk_page_range(unsigned long addr, 
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+		struct vm_area_struct *dst_vma, struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2009-03-14 02:48:06.000000000 +1100
+++ linux-2.6/mm/memory.c	2009-03-14 15:40:37.000000000 +1100
@@ -533,12 +533,248 @@ out:
 }
 
 /*
+ * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
+ * servicing faults for write access.  In the normal case, do always want
+ * pte_mkwrite.  But get_user_pages can cause write faults for mappings
+ * that do not have writing enabled, when used by access_process_vm.
+ */
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pte = pte_mkwrite(pte);
+	return pte;
+}
+
+static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct 
vm_area_struct *vma)
+{
+	/*
+	 * If the source page was a PFN mapping, we don't have
+	 * a "struct page" for it. We do a best-effort copy by
+	 * just copying from the original user address. If that
+	 * fails, we just zero-fill it. Live with it.
+	 */
+	if (unlikely(!src)) {
+		void *kaddr = kmap_atomic(dst, KM_USER0);
+		void __user *uaddr = (void __user *)(va & PAGE_MASK);
+
+		/*
+		 * This really shouldn't fail, because the page is there
+		 * in the page tables. But it might just be unreadable,
+		 * in which case we just give up and fill the result with
+		 * zeroes.
+		 */
+		if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
+			memset(kaddr, 0, PAGE_SIZE);
+		kunmap_atomic(kaddr, KM_USER0);
+		flush_dcache_page(dst);
+	} else
+		copy_user_highpage(dst, src, va, vma);
+}
+
+static int decow_one_pte(struct mm_struct *mm, pte_t *ptep, pmd_t *pmd,
+			spinlock_t *ptl, struct vm_area_struct *vma,
+			unsigned long address)
+{
+	pte_t pte = *ptep;
+	struct page *page, *new_page;
+
+	/* pte contains position in swap or file, so don't do anything */
+	if (unlikely(!pte_present(pte)))
+		return 0;
+	/* pte is writable, can't be COW */
+	if (pte_write(pte))
+		return 0;
+
+	page = vm_normal_page(vma, address, pte);
+	if (!page)
+		return 0;
+
+	if (!PageAnon(page))
+		return 0;
+
+	WARN_ON(!PageDontCOW(page));
+
+	page_cache_get(page);
+
+	pte_unmap_unlock(pte, ptl);
+
+	if (unlikely(anon_vma_prepare(vma)))
+		goto oom;
+	VM_BUG_ON(page == ZERO_PAGE(0));
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	if (!new_page)
+		goto oom;
+	/*
+	 * Don't let another task, with possibly unlocked vma,
+	 * keep the mlocked page.
+	 */
+	if (vma->vm_flags & VM_LOCKED) {
+		lock_page(page);	/* for LRU manipulation */
+		clear_page_mlock(page);
+		unlock_page(page);
+	}
+	cow_user_page(new_page, page, address, vma);
+	__SetPageUptodate(new_page);
+
+	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
+		goto oom_free_new;
+
+	/*
+	 * Re-check the pte - we dropped the lock
+	 */
+	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	BUG_ON(!pte_same(*ptep, pte));
+	{
+		pte_t entry;
+
+		flush_cache_page(vma, address, pte_pfn(pte));
+		entry = mk_pte(new_page, vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		/*
+		 * Clear the pte entry and flush it first, before updating the
+		 * pte with the new entry. This will avoid a race condition
+		 * seen in the presence of one thread doing SMC and another
+		 * thread doing COW.
+		 */
+		ptep_clear_flush_notify(vma, address, ptep);
+		page_add_new_anon_rmap(new_page, vma, address);
+		set_pte_at(mm, address, ptep, entry);
+
+		/* See comment in do_wp_page */
+		page_remove_rmap(page);
+	}
+
+	page_cache_release(page);
+
+	return 0;
+
+oom_free_new:
+	page_cache_release(new_page);
+oom:
+	page_cache_release(page);
+	return -ENOMEM;
+}
+
+static int decow_pte_range(struct mm_struct *mm,
+			pmd_t *pmd, struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+	spinlock_t *ptl;
+	int progress = 0;
+	int ret = 0;
+
+again:
+	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+//	arch_enter_lazy_mmu_mode();
+
+	do {
+		/*
+		 * We are holding two locks at this point - either of them
+		 * could generate latencies in another task on another CPU.
+		 */
+		if (progress >= 32) {
+			progress = 0;
+			if (need_resched() || spin_needbreak(ptl))
+				break;
+		}
+		if (pte_none(*pte)) {
+			progress++;
+			continue;
+		}
+		ret = decow_one_pte(mm, pte, pmd, ptl, vma, addr);
+		if (ret) {
+			if (ret == -EAGAIN) { /* retry */
+				ret = 0;
+				break;
+			}
+			goto out;
+		}
+		progress += 8;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+//	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+	if (addr != end)
+		goto again;
+out:
+	return ret;
+}
+
+static int decow_pmd_range(struct mm_struct *mm,
+			pud_t *pud, struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end)
+{
+	pmd_t *pmd;
+	unsigned long next;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none_or_clear_bad(pmd))
+			continue;
+		if (decow_pte_range(mm, pmd, vma, addr, next))
+			return -ENOMEM;
+	} while (pmd++, addr = next, addr != end);
+	return 0;
+}
+
+static int decow_pud_range(struct mm_struct *mm,
+			pgd_t *pgd, struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end)
+{
+	pud_t *pud;
+	unsigned long next;
+
+	pud = pud_offset(pgd, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		if (pud_none_or_clear_bad(pud))
+			continue;
+		if (decow_pmd_range(mm, pud, vma, addr, next))
+			return -ENOMEM;
+	} while (pud++, addr = next, addr != end);
+	return 0;
+}
+
+static noinline int decow_page_range(struct mm_struct *mm, struct vm_area_struct *vma, unsigned 
long addr, unsigned long end)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	int ret;
+
+	BUG_ON(!is_cow_mapping(vma->vm_flags));
+
+//	if (is_vm_hugetlb_page(vma))
+//		return decow_hugetlb_page_range(mm, vma);
+
+//	mmu_notifier_invalidate_range_start(mm, addr, end);
+
+	ret = 0;
+	pgd = pgd_offset(mm, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (pgd_none_or_clear_bad(pgd))
+			continue;
+		if (unlikely(decow_pud_range(mm, pgd, vma, addr, next))) {
+			ret = -ENOMEM;
+			break;
+		}
+	} while (pgd++, addr = next, addr != end);
+
+//	mmu_notifier_invalidate_range_end(mm, vma->vm_start, end);
+
+	return ret;
+}
+
+/*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
  * covered by this vma.
  */
 
-static inline void
+static inline int
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
 		unsigned long addr, int *rss)
@@ -546,6 +782,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	int ret = 0;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -597,20 +834,26 @@ copy_one_pte(struct mm_struct *dst_mm, s
 		get_page(page);
 		page_dup_rmap(page, vma, addr);
 		rss[!!PageAnon(page)]++;
+		if (unlikely(PageDontCOW(page)))
+			ret = -EAGAIN;
 	}
 
 out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
+
+	return ret;
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		pmd_t *dst_pmd, pmd_t *src_pmd,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, unsigned long end)
 {
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[2];
+	int ret = 0;
 
 again:
 	rss[1] = rss[0] = 0;
@@ -637,7 +880,10 @@ again:
 			progress++;
 			continue;
 		}
-		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
+		ret = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
+						src_vma, addr, rss);
+		if (unlikely(ret))
+			goto decow;
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
@@ -650,10 +896,25 @@ again:
 	if (addr != end)
 		goto again;
 	return 0;
+
+decow:
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(src_ptl);
+	pte_unmap_nested(src_pte);
+	add_mm_rss(dst_mm, rss[0], rss[1]);
+	pte_unmap_unlock(dst_pte, dst_ptl);
+	cond_resched();
+	if (decow_page_range(dst_mm, dst_vma, addr, addr + PAGE_SIZE))
+		return -ENOMEM;
+	addr += PAGE_SIZE;
+	if (addr != end)
+		goto again;
+	return 0;
 }
 
 static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
+		pud_t *dst_pud, pud_t *src_pud,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, unsigned long end)
 {
 	pmd_t *src_pmd, *dst_pmd;
@@ -668,14 +929,15 @@ static inline int copy_pmd_range(struct 
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-						vma, addr, next))
+						dst_vma, src_vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
 
 static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+		pgd_t *dst_pgd, pgd_t *src_pgd,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, unsigned long end)
 {
 	pud_t *src_pud, *dst_pud;
@@ -690,19 +952,19 @@ static inline int copy_pud_range(struct 
 		if (pud_none_or_clear_bad(src_pud))
 			continue;
 		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-						vma, addr, next))
+						dst_vma, src_vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pud++, src_pud++, addr = next, addr != end);
 	return 0;
 }
 
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
 	pgd_t *src_pgd, *dst_pgd;
 	unsigned long next;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
+	unsigned long addr = src_vma->vm_start;
+	unsigned long end = src_vma->vm_end;
 	int ret;
 
 	/*
@@ -711,20 +973,20 @@ int copy_page_range(struct mm_struct *ds
 	 * readonly mappings. The tradeoff is that copy_page_range is more
 	 * efficient than faulting.
 	 */
-	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
-		if (!vma->anon_vma)
+	if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
+		if (!src_vma->anon_vma)
 			return 0;
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+	if (is_vm_hugetlb_page(src_vma))
+		return copy_hugetlb_page_range(dst_mm, src_mm, src_vma);
 
-	if (unlikely(is_pfn_mapping(vma))) {
+	if (unlikely(is_pfn_mapping(src_vma))) {
 		/*
 		 * We do not free on error cases below as remove_vma
 		 * gets called on error from higher level routine
 		 */
-		ret = track_pfn_vma_copy(vma);
+		ret = track_pfn_vma_copy(src_vma);
 		if (ret)
 			return ret;
 	}
@@ -735,7 +997,7 @@ int copy_page_range(struct mm_struct *ds
 	 * parent mm. And a permission downgrade will only happen if
 	 * is_cow_mapping() returns true.
 	 */
-	if (is_cow_mapping(vma->vm_flags))
+	if (is_cow_mapping(src_vma->vm_flags))
 		mmu_notifier_invalidate_range_start(src_mm, addr, end);
 
 	ret = 0;
@@ -746,15 +1008,16 @@ int copy_page_range(struct mm_struct *ds
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-					    vma, addr, next))) {
+					    dst_vma, src_vma, addr, next))) {
 			ret = -ENOMEM;
 			break;
 		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
-	if (is_cow_mapping(vma->vm_flags))
+	if (is_cow_mapping(src_vma->vm_flags))
 		mmu_notifier_invalidate_range_end(src_mm,
-						  vma->vm_start, end);
+						  src_vma->vm_start, end);
+
 	return ret;
 }
 
@@ -1200,7 +1463,6 @@ static inline int use_zero_page(struct v
 }
 
 
-
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, int flags,
 		struct page **pages, struct vm_area_struct **vmas)
@@ -1225,6 +1487,7 @@ int __get_user_pages(struct task_struct 
 	do {
 		struct vm_area_struct *vma;
 		unsigned int foll_flags;
+		int decow;
 
 		vma = find_extend_vma(mm, start);
 		if (!vma && in_gate_area(tsk, start)) {
@@ -1279,12 +1542,15 @@ int __get_user_pages(struct task_struct 
 			continue;
 		}
 
+		decow = (!(flags & GUP_FLAGS_STACK) &&
+					is_cow_mapping(vma->vm_flags));
 		foll_flags = FOLL_TOUCH;
 		if (pages)
 			foll_flags |= FOLL_GET;
 		if (!write && use_zero_page(vma))
 			foll_flags |= FOLL_ANON;
 
+
 		do {
 			struct page *page;
 
@@ -1299,7 +1565,7 @@ int __get_user_pages(struct task_struct 
 					fatal_signal_pending(current)))
 				return i ? i : -ERESTARTSYS;
 
-			if (write)
+			if (write || decow)
 				foll_flags |= FOLL_WRITE;
 
 			cond_resched();
@@ -1342,6 +1608,7 @@ int __get_user_pages(struct task_struct 
 			if (pages) {
 				pages[i] = page;
 
+				SetPageDontCOW(page);
 				flush_anon_page(vma, page, start);
 				flush_dcache_page(page);
 			}
@@ -1829,45 +2096,6 @@ static inline int pte_unmap_same(struct 
 }
 
 /*
- * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
- * servicing faults for write access.  In the normal case, do always want
- * pte_mkwrite.  But get_user_pages can cause write faults for mappings
- * that do not have writing enabled, when used by access_process_vm.
- */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
-{
-	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
-	return pte;
-}
-
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct 
vm_area_struct *vma)
-{
-	/*
-	 * If the source page was a PFN mapping, we don't have
-	 * a "struct page" for it. We do a best-effort copy by
-	 * just copying from the original user address. If that
-	 * fails, we just zero-fill it. Live with it.
-	 */
-	if (unlikely(!src)) {
-		void *kaddr = kmap_atomic(dst, KM_USER0);
-		void __user *uaddr = (void __user *)(va & PAGE_MASK);
-
-		/*
-		 * This really shouldn't fail, because the page is there
-		 * in the page tables. But it might just be unreadable,
-		 * in which case we just give up and fill the result with
-		 * zeroes.
-		 */
-		if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
-			memset(kaddr, 0, PAGE_SIZE);
-		kunmap_atomic(kaddr, KM_USER0);
-		flush_dcache_page(dst);
-	} else
-		copy_user_highpage(dst, src, va, vma);
-}
-
-/*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
  * and decrementing the shared-page counter for the old page.
@@ -1930,6 +2158,8 @@ static int do_wp_page(struct mm_struct *
 		}
 		reuse = reuse_swap_page(old_page);
 		unlock_page(old_page);
+		VM_BUG_ON(PageDontCOW(old_page) && !reuse);
+
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
 		/*
@@ -2935,8 +3165,9 @@ int make_pages_present(unsigned long add
 	BUG_ON(addr >= end);
 	BUG_ON(end > vma->vm_end);
 	len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;
-	ret = get_user_pages(current, current->mm, addr,
-			len, write, 0, NULL, NULL);
+	ret = __get_user_pages(current, current->mm, addr,
+			len, GUP_FLAGS_STACK | (write ? GUP_FLAGS_WRITE : 0),
+			NULL, NULL);
 	if (ret < 0)
 		return ret;
 	return ret == len ? 0 : -EFAULT;
@@ -3085,8 +3316,9 @@ int access_process_vm(struct task_struct
 		void *maddr;
 		struct page *page = NULL;
 
-		ret = get_user_pages(tsk, mm, addr, 1,
-				write, 1, &page, &vma);
+		ret = __get_user_pages(tsk, mm, addr, 1,
+				GUP_FLAGS_FORCE | GUP_FLAGS_STACK |
+				(write ? GUP_FLAGS_WRITE : 0), &page, &vma);
 		if (ret <= 0) {
 			/*
 			 * Check if this is a VM_IO | VM_PFNMAP VMA, which
Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/gup.c	2009-03-14 02:48:06.000000000 +1100
+++ linux-2.6/arch/x86/mm/gup.c	2009-03-14 02:48:12.000000000 +1100
@@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t 
 		struct page *page;
 
 		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+failed:
 			pte_unmap(ptep);
 			return 0;
 		}
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
+		if (unlikely(!PageDontCOW(page)))
+			goto failed;
 		get_page(page);
 		pages[*nr] = page;
 		(*nr)++;
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2009-03-14 02:48:06.000000000 +1100
+++ linux-2.6/include/linux/page-flags.h	2009-03-14 02:48:13.000000000 +1100
@@ -94,6 +94,7 @@ enum pageflags {
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_buddy,		/* Page is free, on buddy lists */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
+	PG_dontcow,		/* Dont COW PageAnon page */
 #ifdef CONFIG_UNEVICTABLE_LRU
 	PG_unevictable,		/* Page is "unevictable"  */
 	PG_mlocked,		/* Page is vma mlocked */
@@ -208,6 +209,8 @@ __PAGEFLAG(SlubDebug, slub_debug)
  */
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
 __PAGEFLAG(Buddy, buddy)
+__PAGEFLAG(DontCOW, dontcow)
+SETPAGEFLAG(DontCOW, dontcow)
 PAGEFLAG(MappedToDisk, mappedtodisk)
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2009-03-13 20:25:02.000000000 +1100
+++ linux-2.6/mm/page_alloc.c	2009-03-14 02:48:13.000000000 +1100
@@ -1000,6 +1000,7 @@ static void free_hot_cold_page(struct pa
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 
+	__ClearPageDontCOW(page);
 	if (PageAnon(page))
 		page->mapping = NULL;
 	if (free_pages_check(page))
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-03-14 02:48:06.000000000 +1100
+++ linux-2.6/kernel/fork.c	2009-03-14 15:12:09.000000000 +1100
@@ -353,7 +353,7 @@ static int dup_mmap(struct mm_struct *mm
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		retval = copy_page_range(mm, oldmm, tmp, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
Index: linux-2.6/fs/exec.c
===================================================================
--- linux-2.6.orig/fs/exec.c	2009-03-13 20:25:00.000000000 +1100
+++ linux-2.6/fs/exec.c	2009-03-14 02:48:14.000000000 +1100
@@ -165,6 +165,13 @@ exit:
 
 #ifdef CONFIG_MMU
 
+#define GUP_FLAGS_WRITE                  0x01
+#define GUP_FLAGS_STACK                  0x10
+
+int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		     unsigned long start, int len, int flags,
+		     struct page **pages, struct vm_area_struct **vmas);
+
 static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
 		int write)
 {
@@ -178,8 +185,11 @@ static struct page *get_arg_page(struct 
 			return NULL;
 	}
 #endif
-	ret = get_user_pages(current, bprm->mm, pos,
-			1, write, 1, &page, NULL);
+	down_read(&bprm->mm->mmap_sem);
+	ret = __get_user_pages(current, bprm->mm, pos,
+			1, GUP_FLAGS_STACK | (write ? GUP_FLAGS_WRITE : 0),
+			&page, NULL);
+	up_read(&bprm->mm->mmap_sem);
 	if (ret <= 0)
 		return NULL;
 
Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h	2009-03-13 20:25:00.000000000 +1100
+++ linux-2.6/mm/internal.h	2009-03-14 02:48:14.000000000 +1100
@@ -273,10 +273,11 @@ static inline void mminit_validate_memmo
 }
 #endif /* CONFIG_SPARSEMEM */
 
-#define GUP_FLAGS_WRITE                  0x1
-#define GUP_FLAGS_FORCE                  0x2
-#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
-#define GUP_FLAGS_IGNORE_SIGKILL         0x8
+#define GUP_FLAGS_WRITE                  0x01
+#define GUP_FLAGS_FORCE                  0x02
+#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x04
+#define GUP_FLAGS_IGNORE_SIGKILL         0x08
+#define GUP_FLAGS_STACK                  0x10
 
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, int flags,
\0

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-13 19:34                       ` Andrea Arcangeli
@ 2009-03-14  4:59                         ` Nick Piggin
  2009-03-16 13:56                           ` Andrea Arcangeli
  0 siblings, 1 reply; 83+ messages in thread
From: Nick Piggin @ 2009-03-14  4:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Saturday 14 March 2009 06:34:16 Andrea Arcangeli wrote:
> On Sat, Mar 14, 2009 at 03:09:39AM +1100, Nick Piggin wrote:
> > Of course I could have a race in fast-gup, but I don't think I can see
> > one. I'm working on removing the vma stuff and just making it per-page,
> > which might make it easier to review.
>
> If you didn't touch gup-fast and you don't send ipis in fork, you most
> certainly have one, it's the one Linus pointed out and that I've fixed
> (with Izik, then I sorted out the ordering details and how to make it
> safe on frok side).

It does touch gup-fast, but it just adds one branch and no barrier in the
case the page is de-cowed (and would be able to work with hugepages with
the get_page_multiple still I think although I haven't done hugepage
implementation yet).


> > Well, it would save having to touch the parent's pagetables after
> > doing the atomic copy-on-fork in the child. Just have the parent do
> > a do_wp_page, which will notice it is the only user of the page and
> > reuse it rather than COW it (now that Hugh has fixed the races in
> > the reuse check that should be fine).
>
> If we're into the trouble path, it means parent already owns the
> page. I just leave it owned to the parent, pte remains the same before
> and after fork. No point in changing the pte value if we're in the
> troublesome path as far as I can tell. I only verify that the parent pte
> didn't go away from under fork when I temporarily release the parent
> PT lock to allocate the cow page in the slow path (see the -EAGAIN
> path, I also verified it triggers with swapping and system survives fine
> ;).

Possibly that's the right way to go. Depends if it is in the slightest
performance critical. If not, I would just let do_wp_page do the work
to avoid a little bit of logic, but either way is not a big deal to me.


> > Now I also see that your patch still hasn't covered the other side of
> > the race, wheras my scheme should do. Hmm, I think that if we want to
>
> Sorry, but can you elaborate again what the other side of the race is?
>
> If child gets a whole new page, and parent keeps its own page with pte
> marked read-write the whole time that a page fault can run (page fault
> takes mmap_sem, all we have to protect against when temporarily
> releasing parent PT lock is the VM rmap code and that is taken care of
> by the pte_same path), so I don't see any other side of the race...

Oh sorry. I was up too late last night :)

One side of the race is direct IO read writing to fork child page.
The other side of the race is fork child page write leaking into
the direct IO.

My patch solves both sides by de-cowing *any* COW page before it
may be returned from get_user_pages (for read or write).

The following test case shows up the corruption both with standard
kernel and your patch, but can't trigger it with my patch. You must
create by hand "file.dat" file of FILESIZE in the cwd. You may have
to tweak timings to get the following order of output:

thread writing
parent storing
child storing
thread writing done

Afterwards hexdump file.dat, and if any 0xff bytes have leaked into
it, then it is from child writing to child buffer affecting parent's
direct IO write.


--- reverse-race.c ---

#define _GNU_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <memory.h>
#include <pthread.h>
#include <getopt.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>

#define FILESIZE (4*1024*1024) 
#define BUFSIZE  (1024*1024)

static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static const char *filename = "file.dat";
static int fd;
static void *buffer;
#define PAGE_SIZE   4096

static void store(void)
{
	int i;

	if (usleep(50*1000) == -1)
		perror("usleep"), exit(1);

	printf("child storing\n"); fflush(stdout);
	for (i = 0; i < BUFSIZE; i++)
		((char *)buffer)[i] = 0xff;

	_exit(0);
}

static void *writer(void *arg)
{
	int i;

	if (pthread_mutex_lock(&lock) == -1)
		perror("pthread_mutex_lock"), exit(1);

	printf("thread writing\n"); fflush(stdout);
	for (i = 0; i < FILESIZE / BUFSIZE; i++) {
		size_t count = BUFSIZE;
		ssize_t ret;

		do {
			ret = write(fd, buffer, count);
			if (ret == -1) {
				if (errno != EINTR)
					perror("write"), exit(1);
				ret = 0;
			}
			count -= ret;
		} while (count);
	}
	printf("thread writing done\n"); fflush(stdout);

	if (pthread_mutex_unlock(&lock) == -1)
		perror("pthread_mutex_lock"), exit(1);

	return NULL;
}

int main(int argc, char *argv[])
{
	int i;
	int status;
	pthread_t writer_thread;
	pid_t store_proc;

	posix_memalign(&buffer, PAGE_SIZE, BUFSIZE);
	printf("Write buffer: %p.\n", buffer);

	for (i = 0; i < BUFSIZE; i++)
		((char *)buffer)[i] = 0x00;

	fd = open(filename, O_RDWR|O_DIRECT);
	if (fd == -1)
		perror("open"), exit(1);

	if (pthread_mutex_lock(&lock) == -1)
		perror("pthread_mutex_lock"), exit(1);

	if (pthread_create(&writer_thread, NULL, writer, NULL) == -1)
		perror("pthred_create"), exit(1);

	store_proc = fork();
	if (store_proc == -1)
		perror("fork"), exit(1);
	if (!store_proc)
		store();

	if (pthread_mutex_unlock(&lock) == -1)
		perror("pthread_mutex_lock"), exit(1);

	if (usleep(10*1000) == -1)
		perror("usleep"), exit(1);

	printf("parent storing\n"); fflush(stdout);
	for (i = 0; i < BUFSIZE; i++)
		((char *)buffer)[i] = 0x11;

	do {
		pid_t w;
		w = waitpid(store_proc, &status, WUNTRACED | WCONTINUED);
		if (w == -1)
			perror("waitpid"), exit(1);
	} while (!WIFEXITED(status) && !WIFSIGNALED(status));

	if (pthread_join(writer_thread, NULL) == -1)
		perror("pthread_join"), exit(1);

	exit(0);
}
\0
N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-14  4:46                       ` Nick Piggin
@ 2009-03-14  5:06                         ` Nick Piggin
  0 siblings, 0 replies; 83+ messages in thread
From: Nick Piggin @ 2009-03-14  5:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Saturday 14 March 2009 15:46:30 Nick Piggin wrote:

> Index: linux-2.6/arch/x86/mm/gup.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/gup.c	2009-03-14 02:48:06.000000000 +1100
> +++ linux-2.6/arch/x86/mm/gup.c	2009-03-14 02:48:12.000000000 +1100
> @@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t
>  		struct page *page;
>
>  		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
> +failed:
>  			pte_unmap(ptep);
>  			return 0;
>  		}
>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  		page = pte_page(pte);
> +		if (unlikely(!PageDontCOW(page)))
> +			goto failed;
>  		get_page(page);
>  		pages[*nr] = page;
>  		(*nr)++;

Ah, that's stupid, the test should be confined just to PageAnon 
&& !PageDontCOW pages, of course.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 20:19               ` Linus Torvalds
  2009-03-11 20:33                 ` Linus Torvalds
  2009-03-11 20:48                 ` Andrea Arcangeli
@ 2009-03-14  5:06                 ` Benjamin Herrenschmidt
  2009-03-14  5:20                   ` Nick Piggin
  2 siblings, 1 reply; 83+ messages in thread
From: Benjamin Herrenschmidt @ 2009-03-14  5:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Wed, 2009-03-11 at 13:19 -0700, Linus Torvalds wrote:
> 
> That said, I don't know who the crazy O_DIRECT users are. It may be true 
> that some O_DIRECT users end up using the same pages over and over again, 
> and that this is a good optimization for them.

Just my 2 cents here...

While I agree mostly with what you say about O_DIRECT crazyness,
unfortunately, gup is also a fashionable interface in a few other areas,
such as IB or RDMA'ish things, and I'm pretty sure we'll see others
popping here or there.

Right, it's a bit stinky, but it -is- somewhat nice for a driver to be
able to take a chunk of existing user addresses and not care whether
they are anonymous, shmem, file mappings, large pages, ... and just gup
and get some DMA pounding on them. There are various usage scenarios
where it's in fact less ugly than anything else you can come up with ...
pretty much.

IB folks so far have been avoiding the fork() trap thanks to
madvise(MADV_DONTFORK) afaik. And it all goes generally well when the
whole application knows what it's doing and just plain avoids fork.

-But- things get nasty if for some reason, the user of gup is somewhere
deep in some kind of library that an application uses without knowing,
while forking here or there to run shell scripts or other helpers.

I've seen it :-)

So if a solution can be found that doesn't uglify the whole thing beyond
recognition, it's probably worth it.

Cheers,
Ben.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-11 20:33                 ` Linus Torvalds
  2009-03-11 20:55                   ` Andrea Arcangeli
@ 2009-03-14  5:07                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 83+ messages in thread
From: Benjamin Herrenschmidt @ 2009-03-14  5:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Wed, 2009-03-11 at 13:33 -0700, Linus Torvalds wrote:
>  - Just make the rule be that people who use get_user_pages() always 
>    have to have the read-lock on mmap_sem until they've used the
> pages.
> 

That's not going to work with IB and friends who gup() whole bunches of
user memory forever...

Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-14  5:06                 ` Benjamin Herrenschmidt
@ 2009-03-14  5:20                   ` Nick Piggin
  2009-03-16 16:01                     ` KOSAKI Motohiro
  0 siblings, 1 reply; 83+ messages in thread
From: Nick Piggin @ 2009-03-14  5:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Andrea Arcangeli, Ingo Molnar, Nick Piggin,
	Hugh Dickins, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Saturday 14 March 2009 16:06:29 Benjamin Herrenschmidt wrote:
> On Wed, 2009-03-11 at 13:19 -0700, Linus Torvalds wrote:
> > That said, I don't know who the crazy O_DIRECT users are. It may be true
> > that some O_DIRECT users end up using the same pages over and over again,
> > and that this is a good optimization for them.
>
> Just my 2 cents here...
>
> While I agree mostly with what you say about O_DIRECT crazyness,
> unfortunately, gup is also a fashionable interface in a few other areas,
> such as IB or RDMA'ish things, and I'm pretty sure we'll see others
> popping here or there.
>
> Right, it's a bit stinky, but it -is- somewhat nice for a driver to be
> able to take a chunk of existing user addresses and not care whether
> they are anonymous, shmem, file mappings, large pages, ... and just gup
> and get some DMA pounding on them. There are various usage scenarios
> where it's in fact less ugly than anything else you can come up with ...
> pretty much.
>
> IB folks so far have been avoiding the fork() trap thanks to
> madvise(MADV_DONTFORK) afaik. And it all goes generally well when the
> whole application knows what it's doing and just plain avoids fork.
>
> -But- things get nasty if for some reason, the user of gup is somewhere
> deep in some kind of library that an application uses without knowing,
> while forking here or there to run shell scripts or other helpers.
>
> I've seen it :-)
>
> So if a solution can be found that doesn't uglify the whole thing beyond
> recognition, it's probably worth it.

AFAIKS, the approach I've posted is probably the simplest (and maybe only
way) to really fix it. It's not too ugly.

You can't easily fix it at write-time by COWing in the right direction like
Linus suggested because at that point you may have multiple get_user_pages
(for read) from the parent and child on the page, so there is no way to COW
it in the right direction.

You could do something crazy like allowing only one get_user_pages read on a
wp page, and recording which direction to send it if it does get COWed. But
at that point you've got something that's far uglier in the core code and
more complex than what I posted.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-14  4:59                         ` Nick Piggin
@ 2009-03-16 13:56                           ` Andrea Arcangeli
  2009-03-16 16:01                             ` Nick Piggin
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-16 13:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Sat, Mar 14, 2009 at 03:59:11PM +1100, Nick Piggin wrote:
> It does touch gup-fast, but it just adds one branch and no barrier in the

My question is what trick to you use to stop gup-fast from returning
the page mapped read-write by the pte if gup-fast doesn't take any
lock whatsoever, it doesn't set any bit in any page or vma, and it
doesn't recheck the pte is still viable after having set any bit on
page or vmas, and you still don't send a flood of ipis from fork fast
path (no race case).

> case the page is de-cowed (and would be able to work with hugepages with
> the get_page_multiple still I think although I haven't done hugepage
> implementation yet).

Yes let's ignore hugetlb for now, I fixed hugetlb too but that can be
left for later.

> Possibly that's the right way to go. Depends if it is in the slightest
> performance critical. If not, I would just let do_wp_page do the work
> to avoid a little bit of logic, but either way is not a big deal to me.

fork is less performance critical than do_wp_page, still in fork
microbenchmark no slowdown is measured with the patch. Before I
introduced PG_gup there were false positives triggered by the pagevec
temporary pins, that was measurable, after PG_gup the fast path is
unaffected (I've still to measure gup-fast slowdown in setting PG_gup
but I'm rather optimistic that you're understimating the cost of
walking 4 layers of pagetables compared to a locked op on a l1
exclusive cacheline, so I think it'll be lost in the noise). I think
the big thing of gup-fast is primarly in not having to search vmas,
and in turn to take any shared lock like mmap_sem/PT lock and to scale
on a page level with just a get-page being the troublesome cacheline.

> One side of the race is direct IO read writing to fork child page.
> The other side of the race is fork child page write leaking into
> the direct IO.
> 
> My patch solves both sides by de-cowing *any* COW page before it
> may be returned from get_user_pages (for read or write).

I see what you mean now. If you read the comment of my patch you'll
see I explicitly intended that only people writing into memory with
gup was troublesome here. Like you point out, using gup for _reading_
from memory is troublesome as well if child writes to those
pages. This is kind of a lower problem because the major issue is that
fork is enough to generate memory corruption even if the child isn't
touching those pages. The reverse race requires the child to write to
those pages so I guess it never triggered in real life apps. But
nevertheless I totally agree if we fix the write-to-memory-with-gup
we've to fix the read-from-memory-with-gup.

Below I updated my patch and relative commit header to fix the reverse
race too. However I had to enlarge the buffer to 40M to reproduce with
your testcase because my HD was too fast otherwise.

----------
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: fork-o_direct-race

Think a thread writing constantly to the last 512bytes of a page,
while another thread read and writes to/from the first 512bytes of the
page. We can lose O_DIRECT reads (or any other get_user_pages write=1
I/O not just bio/O_DIRECT), the very moment we mark any pte
wrprotected because a third unrelated thread forks off a child.

This fixes it by copying the anon page (instead of sharing it) within
fork, if there can be any direct I/O in flight to the page. That takes
care of O_DIRECT reads (writes to memory, read from disk). Checking
the page_count under the PT lock guarantees no get_user_pages could be
running under us because if somebody wants to write to the page, it
has to break any cow first and that requires taking the PT lock in
follow_page before increasing the page count. We are also guaranteed
mapcount is 1 if fork is writeprotecting the pte so the PT lock is
enough to serialize against get_user_pages->get_page.

Another problem are the O_DIRECT writes to disk, if the parent touches
a shared anon page before the child, the child do_wp_page will
takeover the anon page and map it read-write despite it was under
direct-io from the parent thread pool. This requires de-cowing the
pages in gup more aggressively (i.e. setting FOLL_WRITE temporarily on
anon pages to de-cow them, and always assume write=1 for hugetlb
follow_page version).

gup-fast is taken care of without flushing the smp-tlb for every
parent-pte wrprotected, by wrprotecting the pte before checking the
page count vs mapcount. gup-fast will then re-check that the pte is
still available in write mode after having increased the page count,
so solving the race without a flood of IPIs in fork.

The COW triggered inside fork will run while the parent pte is
readonly to provide as usual the per-page atomic copy from parent to
child during fork. However timings will be altered by having to copy
the pages that might be under O_DIRECT.

Once this race is fixed, the testcase instead of showing corruption is
capable of triggering a glibc NPTL race condition where fork_pre_cow
is copying internal the nptl stack list in anonymous memory while some
parent thread may be modifying it, which results in userland deadlock
when the fork-child tries to free the stacks before returning from
fork. We are flushing the tlb after wrprotecting the pte that maps the
anon page if we take the fork_pre_cow path, so we should be providing
per-page atomic copy from parent to child. The race indeed can trigger
also without this patch and without fork_pre_Cow and to trigger it the
wrprotect event must happen exactly in the middle of a
list_add/list_del instruction run by some NPLT thread that is mangling
over the stack list while fork runs. Some preliminary NPTL fix for
this race exposed by this fix, is happening on glibc repository but I
think it'd be better off to use a smart lock capable of jumping in and
out of signal handler and not to go out of order rcu style which
sounds too complex.

The pagevec code calls get_page while the page is sitting in the
pagevec (before it becomes PageLRU) and doing so it can generate false
positives, so to avoid slowing down fork all the time even for pages
that could never possibly be under O_DIRECT write=1, the PG_gup
bitflag is added, this eliminates most overhead of the fix in fork.

I had to add src_vma/dst_vma to use proper ->mm pointers, and in the case of
track_pfn_vma_copy PAT code, this is fixing a bug, because previously vma was
the dst_vma, while track_pfn_vma_copy has to run on the src_vma (the dst_vma in
that place is guaranteed to have zero ptes instantiated/allocated).

There are two testcases that reproduces the bug and they reproduce the bug both
for regular anon pages and using the libhugetlbfs and hugepages too. Patch
works for both. The glibc race is also eventually reproducible both using anon
pages and hugepages with the dma_thread testcase (the forkscrew testcases isn't
capable of reproducing the nptl race condition in fork).

========== dma_thread.c =======
/* compile with 'gcc -g -o dma_thread dma_thread.c -lpthread' */

#define _GNU_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <memory.h>
#include <pthread.h>
#include <getopt.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>

#define FILESIZE (12*1024*1024) 
#define READSIZE  (1024*1024)

#define FILENAME    "test_%.04d.tmp"
#define FILECOUNT   100
#define MIN_WORKERS 2
#define MAX_WORKERS 256
#define PAGE_SIZE   4096

#define true	1
#define false	0

typedef int bool;

bool	done	= false;
int	workers = 2;

#define PATTERN (0xfa)

static void
usage (void)
{
    fprintf(stderr, "\nUsage: dma_thread [-h | -a <alignment> [ -w <workers>]\n"
		    "\nWith no arguments, generate test files and exit.\n"
		    "-h Display this help and exit.\n"
		    "-a align read buffer to offset <alignment>.\n"
		    "-w number of worker threads, 2 (default) to 256,\n"
		    "   defaults to number of cores.\n\n"

		    "Run first with no arguments to generate files.\n"
		    "Then run with -a <alignment> = 512  or 0. \n");
}

typedef struct {
    pthread_t	    tid;
    int		    worker_number;
    int		    fd;
    int		    offset;
    int		    length;
    int		    pattern;
    unsigned char  *buffer;
} worker_t;


void *worker_thread(void * arg)
{
    int		    bytes_read;
    int		    i,k;
    worker_t	   *worker  = (worker_t *) arg;
    int		    offset  = worker->offset;
    int		    fd	    = worker->fd;
    unsigned char  *buffer  = worker->buffer;
    int		    pattern = worker->pattern;
    int		    length  = worker->length;
    
    if (lseek(fd, offset, SEEK_SET) < 0) {
	fprintf(stderr, "Failed to lseek to %d on fd %d: %s.\n", 
			offset, fd, strerror(errno));
	exit(1);
    }

    bytes_read = read(fd, buffer, length);
    if (bytes_read != length) {
	fprintf(stderr, "read failed on fd %d: bytes_read %d, %s\n", 
			fd, bytes_read, strerror(errno));
	exit(1);
    }

    /* Corruption check */
    for (i = 0; i < length; i++) {
	if (buffer[i] != pattern) {
	    printf("Bad data at 0x%.06x: %p, \n", i, buffer + i);
	    printf("Data dump starting at 0x%.06x:\n", i - 8);
	    printf("Expect 0x%x followed by 0x%x:\n",
		    pattern, PATTERN);

	    for (k = 0; k < 16; k++) {
		printf("%02x ", buffer[i - 8 + k]);
		if (k == 7) {
		    printf("\n");
		}       
	    }

	    printf("\n");
	    abort();
	}
    }

    return 0;
}

void *fork_thread (void *arg) 
{
    pid_t pid;

    while (!done) {
	pid = fork();
	if (pid == 0) {
	    exit(0);
	} else if (pid < 0) {
	    fprintf(stderr, "Failed to fork child.\n");
	    exit(1);
	} 
	waitpid(pid, NULL, 0 );
	usleep(100);
    }

    return NULL;

}

int main(int argc, char *argv[])
{
    unsigned char  *buffer = NULL;
    char	    filename[1024];
    int		    fd;
    bool	    dowrite = true;
    pthread_t	    fork_tid;
    int		    c, n, j;
    worker_t	   *worker;
    int		    align = 0;
    int		    offset, rc;

    workers = sysconf(_SC_NPROCESSORS_ONLN);

    while ((c = getopt(argc, argv, "a:hw:")) != -1) {
	switch (c) {
	case 'a':
	    align = atoi(optarg);
	    if (align < 0 || align > PAGE_SIZE) {
		printf("Bad alignment %d.\n", align);
		exit(1);
	    }
	    dowrite = false;
	    break;

	case 'h':
	    usage();
	    exit(0);
	    break;

	case 'w':
	    workers = atoi(optarg);
	    if (workers < MIN_WORKERS || workers > MAX_WORKERS) {
		fprintf(stderr, "Worker count %d not between "
				"%d and %d, inclusive.\n",
				workers, MIN_WORKERS, MAX_WORKERS);
		usage();
		exit(1);
	    }
	    dowrite = false;
	    break;

	default:
	    usage();
	    exit(1);
	}
    }

    if (argc > 1 && (optind < argc)) {
	fprintf(stderr, "Bad command line.\n");
	usage();
	exit(1);
    }

    if (dowrite) {

	buffer = malloc(FILESIZE);
	if (buffer == NULL) {
	    fprintf(stderr, "Failed to malloc write buffer.\n");
	    exit(1);
	}

	for (n = 1; n <= FILECOUNT; n++) {
	    sprintf(filename, FILENAME, n);
	    fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0666);
	    if (fd < 0) {
		printf("create failed(%s): %s.\n", filename, strerror(errno));
		exit(1);
	    }
	    memset(buffer, n, FILESIZE);
	    printf("Writing file %s.\n", filename);
	    if (write(fd, buffer, FILESIZE) != FILESIZE) {
		printf("write failed (%s)\n", filename);
	    }

	    close(fd);
	    fd = -1;
	}

	free(buffer);
	buffer = NULL;

	printf("done\n");
	exit(0);
    }

    printf("Using %d workers.\n", workers);

    worker = malloc(workers * sizeof(worker_t));
    if (worker == NULL) {
	fprintf(stderr, "Failed to malloc worker array.\n");
	exit(1);
    }

    for (j = 0; j < workers; j++) {
	worker[j].worker_number = j;
    }

    printf("Using alignment %d.\n", align);
    
    posix_memalign((void *)&buffer, PAGE_SIZE, READSIZE+ align);
    printf("Read buffer: %p.\n", buffer);
    for (n = 1; n <= FILECOUNT; n++) {

	sprintf(filename, FILENAME, n);
	for (j = 0; j < workers; j++) {
	    if ((worker[j].fd = open(filename,  O_RDONLY|O_DIRECT)) < 0) {
		fprintf(stderr, "Failed to open %s: %s.\n",
				filename, strerror(errno));
		exit(1);
	    }

	    worker[j].pattern = n;
	}

	printf("Reading file %d.\n", n);

	for (offset = 0; offset < FILESIZE; offset += READSIZE) {
	    memset(buffer, PATTERN, READSIZE + align);
	    for (j = 0; j < workers; j++) {
		worker[j].offset = offset + j * PAGE_SIZE;
		worker[j].buffer = buffer + align + j * PAGE_SIZE;
		worker[j].length = PAGE_SIZE;
	    }
	    /* The final worker reads whatever is left over. */
	    worker[workers - 1].length = READSIZE - PAGE_SIZE * (workers - 1);

	    done = 0;

	    rc = pthread_create(&fork_tid, NULL, fork_thread, NULL);
	    if (rc != 0) {
		fprintf(stderr, "Can't create fork thread: %s.\n", 
				strerror(rc));
		exit(1);
	    }

	    for (j = 0; j < workers; j++) {
		rc = pthread_create(&worker[j].tid, 
				    NULL, 
				    worker_thread, 
				    worker + j);
		if (rc != 0) {
		    fprintf(stderr, "Can't create worker thread %d: %s.\n", 
				    j, strerror(rc));
		    exit(1);
		}
	    }

	    for (j = 0; j < workers; j++) {
		rc = pthread_join(worker[j].tid, NULL);
		if (rc != 0) {
		    fprintf(stderr, "Failed to join worker thread %d: %s.\n",
				    j, strerror(rc));
		    exit(1);
		}
	    }

	    /* Let the fork thread know it's ok to exit */
	    done = 1;

	    rc = pthread_join(fork_tid, NULL);
	    if (rc != 0) {
		fprintf(stderr, "Failed to join fork thread: %s.\n",
				strerror(rc));
		exit(1);
	    }
	}

	/* Close the fd's for the next file. */
	for (j = 0; j < workers; j++) {
	    close(worker[j].fd);
	}
    }

    return 0;
}
========== dma_thread.c =======

========== forkscrew.c ========
/*
 * Copyright 2009, Red Hat, Inc.
 *
 * Author: Jeff Moyer <jmoyer@redhat.com>
 *
 * This program attempts to expose a race between O_DIRECT I/O and the fork()
 * path in a multi-threaded program.  In order to reliably reproduce the
 * problem, it is best to perform a dd from the device under test to /dev/null
 * as this makes the read I/O slow enough to orchestrate the problem.
 *
 * Running:  ./forkscrew
 *
 * It is expected that a file name "data" exists in the current working
 * directory, and that its contents are something other than 0x2a.  A simple
 * dd if=/dev/zero of=data bs=1M count=1 should be sufficient.
 */
#define _GNU_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>

#include <pthread.h>
#include <libaio.h>

pthread_cond_t worker_cond = PTHREAD_COND_INITIALIZER;
pthread_mutex_t worker_mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t fork_cond = PTHREAD_COND_INITIALIZER;
pthread_mutex_t fork_mutex = PTHREAD_MUTEX_INITIALIZER;

char *buffer;
int fd;

/* pattern filled into the in-memory buffer */
#define PATTERN		0x2a  // '*'

void
usage(void)
{
	fprintf(stderr,
		"\nUsage: forkscrew\n"
		"it is expected that a file named \"data\" is the current\n"
		"working directory.  It should be at least 3*pagesize in size\n"
		);
}

void
dump_buffer(char *buf, int len)
{
	int i;
	int last_off, last_val;

	last_off = -1;
	last_val = -1;

	for (i = 0; i < len; i++) {
		if (last_off < 0) {
			last_off = i;
			last_val = buf[i];
			continue;
		}

		if (buf[i] != last_val) {
			printf("%d - %d: %d\n", last_off, i - 1, last_val);
			last_off = i;
			last_val = buf[i];
		}
	}

	if (last_off != len - 1)
		printf("%d - %d: %d\n", last_off, i-1, last_val);
}

int
check_buffer(char *bufp, int len, int pattern)
{
	int i;

	for (i = 0; i < len; i++) {
		if (bufp[i] == pattern)
			return 1;
	}
	return 0;
}

void *
forker_thread(void *arg)
{
	pthread_mutex_lock(&fork_mutex);
	pthread_cond_signal(&fork_cond);
	pthread_cond_wait(&fork_cond, &fork_mutex);
	switch (fork()) {
	case 0:
		sleep(1);
		printf("child dumping buffer:\n");
		dump_buffer(buffer + 512, 2*getpagesize());
		exit(0);
	case -1:
		perror("fork");
		exit(1);
	default:
		break;
	}
	pthread_cond_signal(&fork_cond);
	pthread_mutex_unlock(&fork_mutex);

	wait(NULL);
	return (void *)0;
}

void *
worker(void *arg)
{
	int first = (int)arg;
	char *bufp;
	int pagesize = getpagesize();
	int ret;
	int corrupted = 0;

	if (first) {
		io_context_t aioctx;
		struct io_event event;
		struct iocb *iocb = malloc(sizeof *iocb);
		if (!iocb) {
			perror("malloc");
			exit(1);
		}
		memset(&aioctx, 0, sizeof(aioctx));
		ret = io_setup(1, &aioctx);
		if (ret != 0) {
			errno = -ret;
			perror("io_setup");
			exit(1);
		}
		bufp = buffer + 512;
		io_prep_pread(iocb, fd, bufp, pagesize, 0);

		/* submit the I/O */
		io_submit(aioctx, 1, &iocb);

		/* tell the fork thread to run */
		pthread_mutex_lock(&fork_mutex);
		pthread_cond_signal(&fork_cond);

		/* wait for the fork to happen */
		pthread_cond_wait(&fork_cond, &fork_mutex);
		pthread_mutex_unlock(&fork_mutex);

		/* release the other worker to issue I/O */
		pthread_mutex_lock(&worker_mutex);
		pthread_cond_signal(&worker_cond);
		pthread_mutex_unlock(&worker_mutex);

		ret = io_getevents(aioctx, 1, 1, &event, NULL);
		if (ret != 1) {
			errno = -ret;
			perror("io_getevents");
			exit(1);
		}
		if (event.res != pagesize) {
			errno = -event.res;
			perror("read error");
			exit(1);
		}

		io_destroy(aioctx);

		/* check buffer, should be corrupt */
		if (check_buffer(bufp, pagesize, PATTERN)) {
			printf("worker 0 failed check\n");
			dump_buffer(bufp, pagesize);
			corrupted = 1;
		}

	} else {

		bufp = buffer + 512 + pagesize;

		pthread_mutex_lock(&worker_mutex);
		pthread_cond_signal(&worker_cond); /* tell main we're ready */
		/* wait for the first I/O and the fork */
		pthread_cond_wait(&worker_cond, &worker_mutex);
		pthread_mutex_unlock(&worker_mutex);

		/* submit overlapping I/O */
		ret = read(fd, bufp, pagesize);
		if (ret != pagesize) {
			perror("read");
			exit(1);
		}
		/* check buffer, should be fine */
		if (check_buffer(bufp, pagesize, PATTERN)) {
			printf("worker 1 failed check -- abnormal\n");
			dump_buffer(bufp, pagesize);
			corrupted = 1;
		}
	}

	return (void *)corrupted;
}

int
main(int argc, char **argv)
{
	pthread_t workers[2];
	pthread_t forker;
	int ret, rc = 0;
	void *thread_ret;
	int pagesize = getpagesize();

	fd = open("data", O_DIRECT|O_RDONLY);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	ret = posix_memalign(&buffer, pagesize, 3 * pagesize);
	if (ret != 0) {
		errno = ret;
		perror("posix_memalign");
		exit(1);
	}
	memset(buffer, PATTERN, 3*pagesize);

	pthread_mutex_lock(&fork_mutex);
	ret = pthread_create(&forker, NULL, forker_thread, NULL);
	pthread_cond_wait(&fork_cond, &fork_mutex);
	pthread_mutex_unlock(&fork_mutex);

	pthread_mutex_lock(&worker_mutex);
	ret |= pthread_create(&workers[0], NULL, worker, (void *)0);
	if (ret) {
		perror("pthread_create");
		exit(1);
	}
	pthread_cond_wait(&worker_cond, &worker_mutex);
	pthread_mutex_unlock(&worker_mutex);

	ret = pthread_create(&workers[1], NULL, worker, (void *)1);
	if (ret != 0) {
		perror("pthread_create");
		exit(1);
	}

	pthread_join(forker, NULL);
	pthread_join(workers[0], &thread_ret);
	if (thread_ret != 0)
		rc = 1;
	pthread_join(workers[1], &thread_ret);
	if (thread_ret != 0)
		rc = 1;

	if (rc != 0) {
		printf("parent dumping full buffer\n");
		dump_buffer(buffer + 512, 2 * pagesize);
	}

	close(fd);
	free(buffer);
	exit(rc);
}
========== forkscrew.c ========

========== forkscrewreverse.c ========
#define _GNU_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <memory.h>
#include <pthread.h>
#include <getopt.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>

#define FILESIZE (40*1024*1024) 
#define BUFSIZE  (40*1024*1024)

static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static const char *filename = "file.dat";
static int fd;
static void *buffer;
#define PAGE_SIZE   4096

static void store(void)
{
	int i;

	if (usleep(50*1000) == -1)
		perror("usleep"), exit(1);

	printf("child storing\n"); fflush(stdout);
	for (i = 0; i < BUFSIZE; i++)
		((char *)buffer)[i] = 0xff;

	_exit(0);
}

static void *writer(void *arg)
{
	int i;

	if (pthread_mutex_lock(&lock) == -1)
		perror("pthread_mutex_lock"), exit(1);

	printf("thread writing\n"); fflush(stdout);
	for (i = 0; i < FILESIZE / BUFSIZE; i++) {
		size_t count = BUFSIZE;
		ssize_t ret;

		do {
			ret = write(fd, buffer, count);
			if (ret == -1) {
				if (errno != EINTR)
					perror("write"), exit(1);
				ret = 0;
			}
			count -= ret;
		} while (count);
	}
	printf("thread writing done\n"); fflush(stdout);

	if (pthread_mutex_unlock(&lock) == -1)
		perror("pthread_mutex_lock"), exit(1);

	return NULL;
}

int main(int argc, char *argv[])
{
	int i;
	int status;
	pthread_t writer_thread;
	pid_t store_proc;

	posix_memalign(&buffer, PAGE_SIZE, BUFSIZE);
	printf("Write buffer: %p.\n", buffer);

	for (i = 0; i < BUFSIZE; i++)
		((char *)buffer)[i] = 0x00;

	fd = open(filename, O_RDWR|O_DIRECT);
	if (fd == -1)
		perror("open"), exit(1);

	if (pthread_mutex_lock(&lock) == -1)
		perror("pthread_mutex_lock"), exit(1);

	if (pthread_create(&writer_thread, NULL, writer, NULL) == -1)
		perror("pthred_create"), exit(1);

	store_proc = fork();
	if (store_proc == -1)
		perror("fork"), exit(1);
	if (!store_proc)
		store();

	if (pthread_mutex_unlock(&lock) == -1)
		perror("pthread_mutex_lock"), exit(1);

	if (usleep(10*1000) == -1)
		perror("usleep"), exit(1);

	printf("parent storing\n"); fflush(stdout);
	for (i = 0; i < BUFSIZE; i++)
		((char *)buffer)[i] = 0x11;

	do {
		pid_t w;
		w = waitpid(store_proc, &status, WUNTRACED | WCONTINUED);
		if (w == -1)
			perror("waitpid"), exit(1);
	} while (!WIFEXITED(status) && !WIFSIGNALED(status));

	if (pthread_join(writer_thread, NULL) == -1)
		perror("pthread_join"), exit(1);

	exit(0);
}
========== forkscrewreverse.c ========

Normally I test with "dma_thread -a 512 -w 40".

To reproduce or verify the fix with hugepages run it like this:

LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ../test/dma_thread -a 512 -w 40
LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./forkscrew
LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./forkscrewreverse

This is a fixed version of original patch from Nick Piggin.

KSM has the same problem of fork and it also checks the page_count
after a ptep_clear_flush_notify (the _flush sending smp-tlb-flush
stops gup-fast so it doesn't depend on the above gup-fast changes that
allows fork not to flush the smp-tlb at every pte wrprotected, and the
_notify ensure all secondary ptes are zapped and any page-pin released
for mmu-notifier subsystems that take page pins like currently KVM).

BTW, I guess it's pure luck ENOSPC != VM_FAULT_OOM in hugetlb.c,
mixing -errno with -VM_FAULT_* is total breakage that will have to be
cleaned up (either don't use -ENOSPC, or use -ENOMEM instead of
VM_FAULT_OOM), I didn't address it in this patch as it's unrelated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

Removed mtk.manpages@gmail.com, linux-man@vger.kernel.org from
previous CC list.

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -89,6 +89,26 @@ static noinline int gup_pte_range(pmd_t 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 		get_page(page);
+		if (PageAnon(page)) {
+			if (!PageGUP(page))
+				SetPageGUP(page);
+			smp_mb();
+			/*
+			 * Fork doesn't want to flush the smp-tlb for
+			 * every pte that it marks readonly but newly
+			 * created shared anon pages cannot have
+			 * direct-io going to them, so check if fork
+			 * made the page shared before we taken the
+			 * page pin.
+			 * de-cow to make direct read from memory safe.
+			 */
+			if ((pte_flags(gup_get_pte(ptep)) &
+			     (mask | _PAGE_SPECIAL)) != (mask|_PAGE_RW)) {
+				put_page(page);
+				pte_unmap(ptep);
+				return 0;
+			}
+		}
 		pages[*nr] = page;
 		(*nr)++;
 
@@ -98,24 +118,16 @@ static noinline int gup_pte_range(pmd_t 
 	return 1;
 }
 
-static inline void get_head_page_multiple(struct page *page, int nr)
-{
-	VM_BUG_ON(page != compound_head(page));
-	VM_BUG_ON(page_count(page) == 0);
-	atomic_add(nr, &page->_count);
-}
-
-static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
-		unsigned long end, int write, struct page **pages, int *nr)
+static noinline int gup_huge_pmd(pmd_t *pmdp, unsigned long addr,
+		unsigned long end, struct page **pages, int *nr)
 {
 	unsigned long mask;
-	pte_t pte = *(pte_t *)&pmd;
+	pte_t pte = *(pte_t *)pmdp;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
+	/* de-cow to make direct read from memory safe */
+	mask = _PAGE_PRESENT|_PAGE_USER|_PAGE_RW;
 	if ((pte_flags(pte) & mask) != mask)
 		return 0;
 	/* hugepages are never "special" */
@@ -127,12 +139,21 @@ static noinline int gup_huge_pmd(pmd_t p
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	do {
 		VM_BUG_ON(compound_head(page) != head);
+		get_page(head);
+		if (!PageGUP(head))
+			SetPageGUP(head);
+		smp_mb();
+		if ((pte_flags(*(pte_t *)pmdp) & mask) != mask) {
+			put_page(page);
+			return 0;
+		}
 		pages[*nr] = page;
 		(*nr)++;
 		page++;
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
-	get_head_page_multiple(head, refs);
+	VM_BUG_ON(page_count(head) == 0);
+	VM_BUG_ON(head != compound_head(head));
 
 	return 1;
 }
@@ -151,7 +172,7 @@ static int gup_pmd_range(pud_t pud, unsi
 		if (pmd_none(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
-			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
+			if (!gup_huge_pmd(pmdp, addr, next, pages, nr))
 				return 0;
 		} else {
 			if (!gup_pte_range(pmd, addr, next, write, pages, nr))
@@ -162,17 +183,16 @@ static int gup_pmd_range(pud_t pud, unsi
 	return 1;
 }
 
-static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
-		unsigned long end, int write, struct page **pages, int *nr)
+static noinline int gup_huge_pud(pud_t *pudp, unsigned long addr,
+		unsigned long end, struct page **pages, int *nr)
 {
 	unsigned long mask;
-	pte_t pte = *(pte_t *)&pud;
+	pte_t pte = *(pte_t *)pudp;
 	struct page *head, *page;
 	int refs;
 
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
+	/* de-cow to make direct read from memory safe */
+	mask = _PAGE_PRESENT|_PAGE_USER|_PAGE_RW;
 	if ((pte_flags(pte) & mask) != mask)
 		return 0;
 	/* hugepages are never "special" */
@@ -184,12 +204,21 @@ static noinline int gup_huge_pud(pud_t p
 	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	do {
 		VM_BUG_ON(compound_head(page) != head);
+		get_page(head);
+		if (!PageGUP(head))
+			SetPageGUP(head);
+		smp_mb();
+		if ((pte_flags(*(pte_t *)pudp) & mask) != mask) {
+			put_page(page);
+			return 0;
+		}
 		pages[*nr] = page;
 		(*nr)++;
 		page++;
 		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
-	get_head_page_multiple(head, refs);
+	VM_BUG_ON(page_count(head) == 0);
+	VM_BUG_ON(head != compound_head(head));
 
 	return 1;
 }
@@ -208,7 +237,7 @@ static int gup_pud_range(pgd_t pgd, unsi
 		if (pud_none(pud))
 			return 0;
 		if (unlikely(pud_large(pud))) {
-			if (!gup_huge_pud(pud, addr, next, write, pages, nr))
+			if (!gup_huge_pud(pudp, addr, next, pages, nr))
 				return 0;
 		} else {
 			if (!gup_pmd_range(pud, addr, next, write, pages, nr))
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -20,8 +20,8 @@ int hugetlb_sysctl_handler(struct ctl_ta
 int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
 int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
-int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
-int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int, int);
+int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *, struct vm_area_struct *);
+int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
 void unmap_hugepage_range(struct vm_area_struct *,
 			unsigned long, unsigned long, struct page *);
 void __unmap_hugepage_range(struct vm_area_struct *,
@@ -75,9 +75,9 @@ static inline unsigned long hugetlb_tota
 	return 0;
 }
 
-#define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
+#define follow_hugetlb_page(m,v,p,vs,a,b,i)	({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
-#define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
+#define copy_hugetlb_page_range(src, dst, dst_vma, src_vma)	({ BUG(); 0; })
 #define hugetlb_prefault(mapping, vma)		({ BUG(); 0; })
 #define unmap_hugepage_range(vma, start, end, page)	BUG()
 static inline void hugetlb_report_meminfo(struct seq_file *m)
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -789,7 +789,8 @@ void free_pgd_range(struct mmu_gather *t
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+		    struct vm_area_struct *dst_vma,
+		    struct vm_area_struct *src_vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
@@ -1238,7 +1239,7 @@ int vm_insert_mixed(struct vm_area_struc
 			unsigned long pfn);
 
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
-			unsigned int foll_flags);
+			unsigned int *foll_flags);
 #define FOLL_WRITE	0x01	/* check pte is writable */
 #define FOLL_TOUCH	0x02	/* mark page accessed */
 #define FOLL_GET	0x04	/* do get_page on page */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -101,6 +101,7 @@ enum pageflags {
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
+	PG_gup,
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -195,6 +196,7 @@ PAGEFLAG(Private, private) __CLEARPAGEFL
 PAGEFLAG(Private, private) __CLEARPAGEFLAG(Private, private)
 	__SETPAGEFLAG(Private, private)
 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
+PAGEFLAG(GUP, gup) __CLEARPAGEFLAG(GUP, gup)
 
 __PAGEFLAG(SlobPage, slob_page)
 __PAGEFLAG(SlobFree, slob_free)
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -353,7 +353,7 @@ static int dup_mmap(struct mm_struct *mm
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		retval = copy_page_range(mm, oldmm, tmp, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1695,20 +1695,37 @@ static void set_huge_ptep_writable(struc
 	}
 }
 
+/* Return the pagecache page at a given address within a VMA */
+static struct page *hugetlbfs_pagecache_page(struct hstate *h,
+			struct vm_area_struct *vma, unsigned long address)
+{
+	struct address_space *mapping;
+	pgoff_t idx;
+
+	mapping = vma->vm_file->f_mapping;
+	idx = vma_hugecache_offset(h, vma, address);
+
+	return find_lock_page(mapping, idx);
+}
+
+static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
+		       unsigned long address, pte_t *ptep, pte_t pte,
+		       struct page *pagecache_page);
 
 int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
-			    struct vm_area_struct *vma)
+			    struct vm_area_struct *dst_vma,
+			    struct vm_area_struct *src_vma)
 {
-	pte_t *src_pte, *dst_pte, entry;
+	pte_t *src_pte, *dst_pte, entry, orig_entry;
 	struct page *ptepage;
 	unsigned long addr;
-	int cow;
-	struct hstate *h = hstate_vma(vma);
+	int cow, forcecow, oom;
+	struct hstate *h = hstate_vma(src_vma);
 	unsigned long sz = huge_page_size(h);
 
-	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
+	cow = (src_vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+	for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
 		src_pte = huge_pte_offset(src, addr);
 		if (!src_pte)
 			continue;
@@ -1720,22 +1737,76 @@ int copy_hugetlb_page_range(struct mm_st
 		if (dst_pte == src_pte)
 			continue;
 
+		oom = 0;
 		spin_lock(&dst->page_table_lock);
 		spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
-		if (!huge_pte_none(huge_ptep_get(src_pte))) {
-			if (cow)
-				huge_ptep_set_wrprotect(src, addr, src_pte);
-			entry = huge_ptep_get(src_pte);
+		orig_entry = entry = huge_ptep_get(src_pte);
+		forcecow = 0;
+		if (!huge_pte_none(entry)) {
 			ptepage = pte_page(entry);
 			get_page(ptepage);
+			if (cow && pte_write(entry)) {
+				huge_ptep_set_wrprotect(src, addr, src_pte);
+				smp_mb();
+				if (PageGUP(ptepage))
+					forcecow = 1;
+				entry = huge_ptep_get(src_pte);
+			}
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
 		spin_unlock(&src->page_table_lock);
+		if (forcecow) {
+			if (unlikely(vma_needs_reservation(h, dst_vma, addr)
+				     < 0))
+				oom = 1;
+			else {
+				struct page *pg;
+				int cow_ret;
+				spin_unlock(&dst->page_table_lock);
+				/* force atomic copy from parent to child */
+				flush_tlb_range(src_vma, addr, addr+sz);
+				/*
+				 * Can use hstate from src_vma and src_vma
+				 * because the hugetlbfs pagecache will
+				 * be the same for both src_vma and dst_vma.
+				 */
+				pg = hugetlbfs_pagecache_page(h,
+							      src_vma,
+							      addr);
+				spin_lock_nested(&dst->page_table_lock,
+						 SINGLE_DEPTH_NESTING);
+				cow_ret = hugetlb_cow(dst, dst_vma, addr,
+						      dst_pte, entry,
+						      pg);
+				/*
+				 * We hold mmap_sem in write mode and
+				 * the VM doesn't know about hugepages
+				 * so the src_pte/dst_pte can't change
+				 * from under us even without both
+				 * page_table_lock hold the whole time.
+				 */
+				BUG_ON(!pte_same(huge_ptep_get(src_pte),
+						 entry));
+				set_huge_pte_at(src, addr,
+						src_pte,
+						orig_entry);
+				if (cow_ret)
+					oom = 1;
+			}
+		}
 		spin_unlock(&dst->page_table_lock);
+		if (oom)
+			goto nomem;
 	}
 	return 0;
 
 nomem:
+	/*
+	 * Want this to also be able to return -ENOSPC? Then stop the
+	 * mess of mixing -VM_FAULT_ and -ENOSPC retvals and be
+	 * consistent returning -ENOMEM instead of -VM_FAULT_OOM in
+	 * alloc_huge_page.
+	 */
 	return -ENOMEM;
 }
 
@@ -1943,19 +2014,6 @@ retry_avoidcopy:
 	return 0;
 }
 
-/* Return the pagecache page at a given address within a VMA */
-static struct page *hugetlbfs_pagecache_page(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address)
-{
-	struct address_space *mapping;
-	pgoff_t idx;
-
-	mapping = vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, vma, address);
-
-	return find_lock_page(mapping, idx);
-}
-
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pte_t *ptep, int write_access)
 {
@@ -2160,8 +2218,7 @@ static int huge_zeropage_ok(pte_t *ptep,
 
 int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			struct page **pages, struct vm_area_struct **vmas,
-			unsigned long *position, int *length, int i,
-			int write)
+			unsigned long *position, int *length, int i)
 {
 	unsigned long pfn_offset;
 	unsigned long vaddr = *position;
@@ -2181,16 +2238,16 @@ int follow_hugetlb_page(struct mm_struct
 		 * first, for the page indexing below to work.
 		 */
 		pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
-		if (huge_zeropage_ok(pte, write, shared))
+		if (huge_zeropage_ok(pte, 1, shared))
 			zeropage_ok = 1;
 
 		if (!pte ||
 		    (huge_pte_none(huge_ptep_get(pte)) && !zeropage_ok) ||
-		    (write && !pte_write(huge_ptep_get(pte)))) {
+		    !pte_write(huge_ptep_get(pte))) {
 			int ret;
 
 			spin_unlock(&mm->page_table_lock);
-			ret = hugetlb_fault(mm, vma, vaddr, write);
+			ret = hugetlb_fault(mm, vma, vaddr, 1);
 			spin_lock(&mm->page_table_lock);
 			if (!(ret & VM_FAULT_ERROR))
 				continue;
@@ -2207,8 +2264,11 @@ same_page:
 		if (pages) {
 			if (zeropage_ok)
 				pages[i] = ZERO_PAGE(0);
-			else
+			else {
 				pages[i] = mem_map_offset(page, pfn_offset);
+				if (!PageGUP(page))
+					SetPageGUP(page);
+			}
 			get_page(pages[i]);
 		}
 
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -538,14 +538,16 @@ out:
  * covered by this vma.
  */
 
-static inline void
+static inline int
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
+		pte_t *dst_pte, pte_t *src_pte,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, int *rss)
 {
-	unsigned long vm_flags = vma->vm_flags;
+	unsigned long vm_flags = src_vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	int forcecow = 0;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -576,15 +578,6 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	}
 
 	/*
-	 * If it's a COW mapping, write protect it both
-	 * in the parent and the child
-	 */
-	if (is_cow_mapping(vm_flags)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
-		pte = pte_wrprotect(pte);
-	}
-
-	/*
 	 * If it's a shared mapping, mark it clean in
 	 * the child
 	 */
@@ -592,27 +585,87 @@ copy_one_pte(struct mm_struct *dst_mm, s
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
 
-	page = vm_normal_page(vma, addr, pte);
+	/*
+	 * If it's a COW mapping, write protect it both
+	 * in the parent and the child.
+	 */
+	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
+		/*
+		 * Serialization against gup-fast happens by
+		 * wrprotecting the pte and checking the PG_gup flag
+		 * and the number of page pins after that. If gup-fast
+		 * boosts the page_count after we checked it, it will
+		 * also take the slow path because it will find the
+		 * pte wrprotected.
+		 */
+		ptep_set_wrprotect(src_mm, addr, src_pte);
+	}
+
+	page = vm_normal_page(src_vma, addr, pte);
 	if (page) {
 		get_page(page);
-		page_dup_rmap(page, vma, addr);
+		page_dup_rmap(page, dst_vma, addr);
+		if (is_cow_mapping(vm_flags) && pte_write(pte) &&
+		    PageAnon(page)) {
+			smp_mb();
+			if (PageGUP(page)) {
+				if (unlikely(!trylock_page(page)))
+					forcecow = 1;
+				else {
+					BUG_ON(page_mapcount(page) != 2);
+					if (unlikely(page_count(page) !=
+						     page_mapcount(page)
+						     + !!PageSwapCache(page)))
+						forcecow = 1;
+					unlock_page(page);
+				}
+			}
+		}
 		rss[!!PageAnon(page)]++;
+	}
+
+	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
+		pte = pte_wrprotect(pte);
+		if (forcecow) {
+			/* force atomic copy from parent to child */
+			flush_tlb_page(src_vma, addr);
+			/*
+			 * Don't set the dst_pte here to be
+			 * safer, as fork_pre_cow might return
+			 * -EAGAIN and restart.
+			 */
+			goto out;
+		}
 	}
 
 out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
+out:
+	return forcecow;
 }
 
+static int fork_pre_cow(struct mm_struct *dst_mm,
+			struct mm_struct *src_mm,
+			struct vm_area_struct *dst_vma,
+			struct vm_area_struct *src_vma,
+			unsigned long address,
+			pte_t **dst_ptep, pte_t **src_ptep,
+			spinlock_t **dst_ptlp, spinlock_t **src_ptlp,
+			pmd_t *dst_pmd, pmd_t *src_pmd);
+
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		pmd_t *dst_pmd, pmd_t *src_pmd,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, unsigned long end)
 {
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[2];
+	int forcecow;
 
 again:
+	forcecow = 0;
 	rss[1] = rss[0] = 0;
 	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
 	if (!dst_pte)
@@ -623,6 +676,9 @@ again:
 	arch_enter_lazy_mmu_mode();
 
 	do {
+		if (forcecow)
+			break;
+
 		/*
 		 * We are holding two locks at this point - either of them
 		 * could generate latencies in another task on another CPU.
@@ -637,9 +693,38 @@ again:
 			progress++;
 			continue;
 		}
-		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
+		forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
+					dst_vma, src_vma, addr, rss);
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+
+	if (unlikely(forcecow)) {
+		pte_t *_src_pte = src_pte-1, *_dst_pte = dst_pte-1;
+		/*
+		 * Try to COW the child page as direct I/O is working
+		 * on the parent page, and so we've to mark the parent
+		 * pte read-write before dropping the PT lock and
+		 * mmap_sem to avoid the page to be cowed in the
+		 * parent and any direct I/O to get lost.
+		 */
+		forcecow = fork_pre_cow(dst_mm, src_mm,
+					dst_vma, src_vma,
+					addr-PAGE_SIZE,
+					&_dst_pte, &_src_pte,
+					&dst_ptl, &src_ptl,
+					dst_pmd, src_pmd);
+		src_pte = _src_pte + 1;
+		dst_pte = _dst_pte + 1;
+		/* after the page copy set the parent pte writeable again */
+		set_pte_at(src_mm, addr-PAGE_SIZE, src_pte-1,
+			   pte_mkwrite(*(src_pte-1)));
+		if (unlikely(forcecow == -EAGAIN)) {
+			dst_pte--;
+			src_pte--;
+			addr -= PAGE_SIZE;
+			rss[1]--;
+		}
+	}
 
 	arch_leave_lazy_mmu_mode();
 	spin_unlock(src_ptl);
@@ -647,13 +732,16 @@ again:
 	add_mm_rss(dst_mm, rss[0], rss[1]);
 	pte_unmap_unlock(dst_pte - 1, dst_ptl);
 	cond_resched();
+	if (unlikely(forcecow == -ENOMEM))
+		return -ENOMEM;
 	if (addr != end)
 		goto again;
 	return 0;
 }
 
 static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
+		pud_t *dst_pud, pud_t *src_pud,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, unsigned long end)
 {
 	pmd_t *src_pmd, *dst_pmd;
@@ -668,14 +756,15 @@ static inline int copy_pmd_range(struct 
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-						vma, addr, next))
+				   dst_vma, src_vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
 
 static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+		pgd_t *dst_pgd, pgd_t *src_pgd,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, unsigned long end)
 {
 	pud_t *src_pud, *dst_pud;
@@ -690,19 +779,20 @@ static inline int copy_pud_range(struct 
 		if (pud_none_or_clear_bad(src_pud))
 			continue;
 		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-						vma, addr, next))
+				   dst_vma, src_vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pud++, src_pud++, addr = next, addr != end);
 	return 0;
 }
 
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		struct vm_area_struct *vma)
+		    struct vm_area_struct *dst_vma,
+		    struct vm_area_struct *src_vma)
 {
 	pgd_t *src_pgd, *dst_pgd;
 	unsigned long next;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
+	unsigned long addr = src_vma->vm_start;
+	unsigned long end = src_vma->vm_end;
 	int ret;
 
 	/*
@@ -711,20 +801,21 @@ int copy_page_range(struct mm_struct *ds
 	 * readonly mappings. The tradeoff is that copy_page_range is more
 	 * efficient than faulting.
 	 */
-	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
-		if (!vma->anon_vma)
+	if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
+		if (!src_vma->anon_vma)
 			return 0;
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+	if (is_vm_hugetlb_page(src_vma))
+		return copy_hugetlb_page_range(dst_mm, src_mm,
+					       dst_vma, src_vma);
 
-	if (unlikely(is_pfn_mapping(vma))) {
+	if (unlikely(is_pfn_mapping(src_vma))) {
 		/*
 		 * We do not free on error cases below as remove_vma
 		 * gets called on error from higher level routine
 		 */
-		ret = track_pfn_vma_copy(vma);
+		ret = track_pfn_vma_copy(src_vma);
 		if (ret)
 			return ret;
 	}
@@ -735,7 +826,7 @@ int copy_page_range(struct mm_struct *ds
 	 * parent mm. And a permission downgrade will only happen if
 	 * is_cow_mapping() returns true.
 	 */
-	if (is_cow_mapping(vma->vm_flags))
+	if (is_cow_mapping(src_vma->vm_flags))
 		mmu_notifier_invalidate_range_start(src_mm, addr, end);
 
 	ret = 0;
@@ -746,15 +837,15 @@ int copy_page_range(struct mm_struct *ds
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-					    vma, addr, next))) {
+					    dst_vma, src_vma, addr, next))) {
 			ret = -ENOMEM;
 			break;
 		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
-	if (is_cow_mapping(vma->vm_flags))
+	if (is_cow_mapping(src_vma->vm_flags))
 		mmu_notifier_invalidate_range_end(src_mm,
-						  vma->vm_start, end);
+						  src_vma->vm_start, end);
 	return ret;
 }
 
@@ -1091,7 +1182,7 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes);
  * Do a quick page-table lookup for a single page.
  */
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
-			unsigned int flags)
+			unsigned int *flagsp)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -1100,6 +1191,7 @@ struct page *follow_page(struct vm_area_
 	spinlock_t *ptl;
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
+	unsigned long flags = *flagsp;
 
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
@@ -1145,8 +1237,19 @@ struct page *follow_page(struct vm_area_
 	if (unlikely(!page))
 		goto bad_page;
 
-	if (flags & FOLL_GET)
+	if (flags & FOLL_GET) {
+		if (PageAnon(page)) {
+			/* de-cow to make direct read from memory safe */
+			if (!pte_write(pte)) {
+				page = NULL;
+				*flagsp |= FOLL_WRITE;
+				goto unlock;
+			}
+			if (!PageGUP(page))
+				SetPageGUP(page);
+		}
 		get_page(page);
+	}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
@@ -1275,7 +1378,7 @@ int __get_user_pages(struct task_struct 
 
 		if (is_vm_hugetlb_page(vma)) {
 			i = follow_hugetlb_page(mm, vma, pages, vmas,
-						&start, &len, i, write);
+						&start, &len, i);
 			continue;
 		}
 
@@ -1303,7 +1406,7 @@ int __get_user_pages(struct task_struct 
 				foll_flags |= FOLL_WRITE;
 
 			cond_resched();
-			while (!(page = follow_page(vma, start, foll_flags))) {
+			while (!(page = follow_page(vma, start, &foll_flags))) {
 				int ret;
 				ret = handle_mm_fault(mm, vma, start,
 						foll_flags & FOLL_WRITE);
@@ -1865,6 +1968,81 @@ static inline void cow_user_page(struct 
 		flush_dcache_page(dst);
 	} else
 		copy_user_highpage(dst, src, va, vma);
+}
+
+static int fork_pre_cow(struct mm_struct *dst_mm,
+			struct mm_struct *src_mm,
+			struct vm_area_struct *dst_vma,
+			struct vm_area_struct *src_vma,
+			unsigned long address,
+			pte_t **dst_ptep, pte_t **src_ptep,
+			spinlock_t **dst_ptlp, spinlock_t **src_ptlp,
+			pmd_t *dst_pmd, pmd_t *src_pmd)
+{
+	pte_t _src_pte, _dst_pte;
+	struct page *old_page, *new_page;
+
+	_src_pte = **src_ptep;
+	_dst_pte = **dst_ptep;
+	old_page = vm_normal_page(src_vma, address, **src_ptep);
+	BUG_ON(!old_page);
+	get_page(old_page);
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(*src_ptlp);
+	pte_unmap_nested(*src_ptep);
+	pte_unmap_unlock(*dst_ptep, *dst_ptlp);
+
+	new_page = alloc_page_vma(GFP_HIGHUSER, dst_vma, address);
+	if (unlikely(!new_page)) {
+		*dst_ptep = pte_offset_map_lock(dst_mm, dst_pmd, address,
+						dst_ptlp);
+		*src_ptep = pte_offset_map_nested(src_pmd, address);
+		*src_ptlp = pte_lockptr(src_mm, src_pmd);
+		spin_lock_nested(*src_ptlp, SINGLE_DEPTH_NESTING);
+		arch_enter_lazy_mmu_mode();
+		return -ENOMEM;
+	}
+	cow_user_page(new_page, old_page, address, dst_vma);
+
+	*dst_ptep = pte_offset_map_lock(dst_mm, dst_pmd, address, dst_ptlp);
+	*src_ptep = pte_offset_map_nested(src_pmd, address);
+	*src_ptlp = pte_lockptr(src_mm, src_pmd);
+	spin_lock_nested(*src_ptlp, SINGLE_DEPTH_NESTING);
+	arch_enter_lazy_mmu_mode();
+
+	/*
+	 * src pte can unmapped by the VM from under us after dropping
+	 * the src_ptlp but it can't be cowed from under us as fork
+	 * holds the mmap_sem in write mode.
+	 */
+	if (!pte_same(**src_ptep, _src_pte))
+		goto eagain;
+	if (!pte_same(**dst_ptep, _dst_pte))
+		goto eagain;
+
+	page_remove_rmap(old_page);
+	page_cache_release(old_page);
+	page_cache_release(old_page);
+
+	__SetPageUptodate(new_page);
+	flush_cache_page(src_vma, address, pte_pfn(**src_ptep));
+	_dst_pte = mk_pte(new_page, dst_vma->vm_page_prot);
+	_dst_pte = maybe_mkwrite(pte_mkdirty(_dst_pte), dst_vma);
+	page_add_new_anon_rmap(new_page, dst_vma, address);
+	set_pte_at(dst_mm, address, *dst_ptep, _dst_pte);
+	update_mmu_cache(dst_vma, address, _dst_pte);
+	return 0;
+
+eagain:
+	page_cache_release(old_page);
+	page_cache_release(new_page);
+	/*
+	 * Later we'll repeat the copy of this pte, so here we've to
+	 * undo the mapcount and page count taken in copy_one_pte.
+	 */
+	page_remove_rmap(old_page);
+	page_cache_release(old_page);
+	return -EAGAIN;
 }
 
 /*
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -64,6 +64,8 @@ static void put_compound_page(struct pag
 	if (put_page_testzero(page)) {
 		compound_page_dtor *dtor;
 
+		if (PageGUP(page))
+			__ClearPageGUP(page);
 		dtor = get_compound_page_dtor(page);
 		(*dtor)(page);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 13:56                           ` Andrea Arcangeli
@ 2009-03-16 16:01                             ` Nick Piggin
  0 siblings, 0 replies; 83+ messages in thread
From: Nick Piggin @ 2009-03-16 16:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm

On Tuesday 17 March 2009 00:56:54 Andrea Arcangeli wrote:
> On Sat, Mar 14, 2009 at 03:59:11PM +1100, Nick Piggin wrote:
> > It does touch gup-fast, but it just adds one branch and no barrier in the
>
> My question is what trick to you use to stop gup-fast from returning
> the page mapped read-write by the pte if gup-fast doesn't take any
> lock whatsoever, it doesn't set any bit in any page or vma, and it
> doesn't recheck the pte is still viable after having set any bit on
> page or vmas, and you still don't send a flood of ipis from fork fast
> path (no race case).

If the page is not marked PageDontCOW, then it decows it, which
gives synchronisation against fork. If it is marked PageDontCOW,
then it can't possibly be COWed by fork, previous or subsequent.


> > Possibly that's the right way to go. Depends if it is in the slightest
> > performance critical. If not, I would just let do_wp_page do the work
> > to avoid a little bit of logic, but either way is not a big deal to me.
>
> fork is less performance critical than do_wp_page, still in fork
> microbenchmark no slowdown is measured with the patch. Before I
> introduced PG_gup there were false positives triggered by the pagevec
> temporary pins, that was measurable, after PG_gup the fast path is

OK. Mine doesn't get false positives, but it doesn't try to reintroduce
pages as COW candidates after the get_user_pages is finished. This is
how it is simpler than your patch.


> unaffected (I've still to measure gup-fast slowdown in setting PG_gup
> but I'm rather optimistic that you're understimating the cost of
> walking 4 layers of pagetables compared to a locked op on a l1
> exclusive cacheline, so I think it'll be lost in the noise). I think
> the big thing of gup-fast is primarly in not having to search vmas,
> and in turn to take any shared lock like mmap_sem/PT lock and to scale
> on a page level with just a get-page being the troublesome cacheline.

You lost the get_head_page_multiple too for huge pages. This is the
path that Oracle/DB2 will always go down when running any benchmarks.
At the current DIO_PAGES size, this means adding up to 63 atomics,
64 mfences, and and touching cachelines of 63-64 of the non-head struct
pages per request.

OK probably even those databases don't get a chance to do such big IOs,
but they definitely will be doing larger than 4K at a time in many
cases (probably even their internal block size can be larger).


> > One side of the race is direct IO read writing to fork child page.
> > The other side of the race is fork child page write leaking into
> > the direct IO.
> >
> > My patch solves both sides by de-cowing *any* COW page before it
> > may be returned from get_user_pages (for read or write).
>
> I see what you mean now. If you read the comment of my patch you'll
> see I explicitly intended that only people writing into memory with
> gup was troublesome here. Like you point out, using gup for _reading_
> from memory is troublesome as well if child writes to those
> pages. This is kind of a lower problem because the major issue is that
> fork is enough to generate memory corruption even if the child isn't
> touching those pages. The reverse race requires the child to write to
> those pages so I guess it never triggered in real life apps. But
> nevertheless I totally agree if we fix the write-to-memory-with-gup
> we've to fix the read-from-memory-with-gup.

Yes.


> Below I updated my patch and relative commit header to fix the reverse
> race too. However I had to enlarge the buffer to 40M to reproduce with
> your testcase because my HD was too fast otherwise.

You're using a solid state disk? :)


> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -89,6 +89,26 @@ static noinline int gup_pte_range(pmd_t
>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  		page = pte_page(pte);
>  		get_page(page);
> +		if (PageAnon(page)) {
> +			if (!PageGUP(page))
> +				SetPageGUP(page);
> +			smp_mb();
> +			/*
> +			 * Fork doesn't want to flush the smp-tlb for
> +			 * every pte that it marks readonly but newly
> +			 * created shared anon pages cannot have
> +			 * direct-io going to them, so check if fork
> +			 * made the page shared before we taken the
> +			 * page pin.
> +			 * de-cow to make direct read from memory safe.
> +			 */
> +			if ((pte_flags(gup_get_pte(ptep)) &
> +			     (mask | _PAGE_SPECIAL)) != (mask|_PAGE_RW)) {
> +				put_page(page);
> +				pte_unmap(ptep);
> +				return 0;

Hmm, so this is disabling fast-gup for RO anonymous ranges?

I guess this seems like it covers the reverse race then... btw powerpc
has a slightly different fast-gup scheme where it isn't actually holding
off TLB shootdown. I don't think you need to do anything too different,
but better double check.

And here is my improved patch. Same logic but just streamlines the
decow stuff a bit and cuts out some unneeded stuff. This should be
pretty complete for 4K pages. Except I'm a little unsure about the
"ptes don't match, retry" path of the decow procedure. Lots of tricky
little details to get right... And I'm not quite sure that you got
this right either -- vmscan.c can turn the child pte into a swap pte
here, right? In which case I think you need to drop its swapcache
entry don't you? I don't know if there are other ways it could be
changed, but I import the full zap_pte function over just in case.

--
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2009-03-14 02:48:06.000000000 +1100
+++ linux-2.6/include/linux/mm.h	2009-03-17 00:37:59.000000000 +1100
@@ -789,7 +789,7 @@ int walk_page_range(unsigned long addr, 
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma);
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2009-03-14 02:48:06.000000000 +1100
+++ linux-2.6/mm/memory.c	2009-03-17 02:43:21.000000000 +1100
@@ -533,12 +533,171 @@ out:
 }
 
 /*
+ * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
+ * servicing faults for write access.  In the normal case, do always want
+ * pte_mkwrite.  But get_user_pages can cause write faults for mappings
+ * that do not have writing enabled, when used by access_process_vm.
+ */
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pte = pte_mkwrite(pte);
+	return pte;
+}
+
+static void cow_user_page(struct page *dst, struct page *src,
+			unsigned long va, struct vm_area_struct *vma)
+{
+	/*
+	 * If the source page was a PFN mapping, we don't have
+	 * a "struct page" for it. We do a best-effort copy by
+	 * just copying from the original user address. If that
+	 * fails, we just zero-fill it. Live with it.
+	 */
+	if (unlikely(!src)) {
+		void *kaddr = kmap_atomic(dst, KM_USER0);
+		void __user *uaddr = (void __user *)(va & PAGE_MASK);
+
+		/*
+		 * This really shouldn't fail, because the page is there
+		 * in the page tables. But it might just be unreadable,
+		 * in which case we just give up and fill the result with
+		 * zeroes.
+		 */
+		if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
+			memset(kaddr, 0, PAGE_SIZE);
+		kunmap_atomic(kaddr, KM_USER0);
+		flush_dcache_page(dst);
+	} else
+		copy_user_highpage(dst, src, va, vma);
+}
+
+void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long addr, pte_t *ptep)
+{
+	pte_t pte = *ptep;
+
+	if (pte_present(pte)) {
+		struct page *page;
+
+		flush_cache_page(vma, addr, pte_pfn(pte));
+		pte = ptep_clear_flush(vma, addr, ptep);
+		page = vm_normal_page(vma, addr, pte);
+		if (page) {
+			if (pte_dirty(pte))
+				set_page_dirty(page);
+			page_remove_rmap(page);
+			page_cache_release(page);
+			update_hiwater_rss(mm);
+			if (PageAnon(page))
+				dec_mm_counter(mm, anon_rss);
+			else
+				dec_mm_counter(mm, file_rss);
+		}
+	} else {
+		if (!pte_file(pte))
+			free_swap_and_cache(pte_to_swp_entry(pte));
+		pte_clear_not_present_full(mm, addr, ptep, 0);
+	}
+}
+/*
+ * breaks COW of child pte that has been marked COW by fork().
+ * Must be called with the child's ptl held and pte mapped.
+ * Returns 0 on success with ptl held and pte mapped.
+ * -ENOMEM on OOM failure, or -EAGAIN if something changed under us.
+ * ptl dropped and pte unmapped on error cases.
+ */
+static noinline int decow_one_pte(struct mm_struct *mm, pte_t *ptep, pmd_t *pmd,
+			spinlock_t *ptl, struct vm_area_struct *vma,
+			unsigned long address)
+{
+	pte_t pte = *ptep;
+	struct page *page, *new_page;
+	int ret;
+
+	BUG_ON(!pte_present(pte));
+	BUG_ON(pte_write(pte));
+
+	page = vm_normal_page(vma, address, pte);
+	BUG_ON(!page);
+	BUG_ON(!PageAnon(page));
+	BUG_ON(!PageDontCOW(page));
+
+	/* The following code comes from do_wp_page */
+	page_cache_get(page);
+	pte_unmap_unlock(pte, ptl);
+
+	if (unlikely(anon_vma_prepare(vma)))
+		goto oom;
+	VM_BUG_ON(page == ZERO_PAGE(0));
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	if (!new_page)
+		goto oom;
+	/*
+	 * Don't let another task, with possibly unlocked vma,
+	 * keep the mlocked page.
+	 */
+	if (vma->vm_flags & VM_LOCKED) {
+		lock_page(page);	/* for LRU manipulation */
+		clear_page_mlock(page);
+		unlock_page(page);
+	}
+	cow_user_page(new_page, page, address, vma);
+	__SetPageUptodate(new_page);
+
+	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
+		goto oom_free_new;
+
+	/*
+	 * Re-check the pte - we dropped the lock
+	 */
+	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (pte_same(*ptep, pte)) {
+		pte_t entry;
+
+		flush_cache_page(vma, address, pte_pfn(pte));
+		entry = mk_pte(new_page, vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		/*
+		 * Clear the pte entry and flush it first, before updating the
+		 * pte with the new entry. This will avoid a race condition
+		 * seen in the presence of one thread doing SMC and another
+		 * thread doing COW.
+		 */
+		ptep_clear_flush_notify(vma, address, ptep);
+		page_add_new_anon_rmap(new_page, vma, address);
+		set_pte_at(mm, address, ptep, entry);
+
+		/* See comment in do_wp_page */
+		page_remove_rmap(page);
+		page_cache_release(page);
+		ret = 0;
+	} else {
+		if (!pte_none(*ptep))
+			zap_pte(mm, vma, address, ptep);
+		pte_unmap_unlock(pte, ptl);
+		mem_cgroup_uncharge_page(new_page);
+		page_cache_release(new_page);
+		ret = -EAGAIN;
+	}
+	page_cache_release(page);
+
+	return ret;
+
+oom_free_new:
+	page_cache_release(new_page);
+oom:
+	page_cache_release(page);
+	return -ENOMEM;
+}
+
+/*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
  * covered by this vma.
  */
 
-static inline void
+static inline int
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
 		unsigned long addr, int *rss)
@@ -546,6 +705,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
+	int ret = 0;
 
 	/* pte contains position in swap or file, so copy. */
 	if (unlikely(!pte_present(pte))) {
@@ -597,20 +757,26 @@ copy_one_pte(struct mm_struct *dst_mm, s
 		get_page(page);
 		page_dup_rmap(page, vma, addr);
 		rss[!!PageAnon(page)]++;
+		if (unlikely(PageDontCOW(page)))
+			ret = 1;
 	}
 
 out_set_pte:
 	set_pte_at(dst_mm, addr, dst_pte, pte);
+
+	return ret;
 }
 
 static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		pmd_t *dst_pmd, pmd_t *src_pmd,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, unsigned long end)
 {
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[2];
+	int decow;
 
 again:
 	rss[1] = rss[0] = 0;
@@ -637,7 +803,10 @@ again:
 			progress++;
 			continue;
 		}
-		copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
+		decow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
+						src_vma, addr, rss);
+		if (unlikely(decow))
+			goto decow;
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
@@ -646,14 +815,31 @@ again:
 	pte_unmap_nested(src_pte - 1);
 	add_mm_rss(dst_mm, rss[0], rss[1]);
 	pte_unmap_unlock(dst_pte - 1, dst_ptl);
+next:
 	cond_resched();
 	if (addr != end)
 		goto again;
 	return 0;
+
+decow:
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(src_ptl);
+	pte_unmap_nested(src_pte);
+	add_mm_rss(dst_mm, rss[0], rss[1]);
+	decow = decow_one_pte(dst_mm, dst_pte, dst_pmd, dst_ptl, dst_vma, addr);
+	if (decow == -ENOMEM)
+		return -ENOMEM;
+	if (decow == -EAGAIN)
+		goto again;
+	pte_unmap_unlock(dst_pte, dst_ptl);
+	cond_resched();
+	addr += PAGE_SIZE;
+	goto next;
 }
 
 static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
+		pud_t *dst_pud, pud_t *src_pud,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, unsigned long end)
 {
 	pmd_t *src_pmd, *dst_pmd;
@@ -668,14 +854,15 @@ static inline int copy_pmd_range(struct 
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-						vma, addr, next))
+						dst_vma, src_vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
 
 static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
+		pgd_t *dst_pgd, pgd_t *src_pgd,
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		unsigned long addr, unsigned long end)
 {
 	pud_t *src_pud, *dst_pud;
@@ -690,19 +877,19 @@ static inline int copy_pud_range(struct 
 		if (pud_none_or_clear_bad(src_pud))
 			continue;
 		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-						vma, addr, next))
+						dst_vma, src_vma, addr, next))
 			return -ENOMEM;
 	} while (dst_pud++, src_pud++, addr = next, addr != end);
 	return 0;
 }
 
 int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
 	pgd_t *src_pgd, *dst_pgd;
 	unsigned long next;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
+	unsigned long addr = src_vma->vm_start;
+	unsigned long end = src_vma->vm_end;
 	int ret;
 
 	/*
@@ -711,20 +898,20 @@ int copy_page_range(struct mm_struct *ds
 	 * readonly mappings. The tradeoff is that copy_page_range is more
 	 * efficient than faulting.
 	 */
-	if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
-		if (!vma->anon_vma)
+	if (!(src_vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
+		if (!src_vma->anon_vma)
 			return 0;
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+	if (is_vm_hugetlb_page(src_vma))
+		return copy_hugetlb_page_range(dst_mm, src_mm, src_vma);
 
-	if (unlikely(is_pfn_mapping(vma))) {
+	if (unlikely(is_pfn_mapping(src_vma))) {
 		/*
 		 * We do not free on error cases below as remove_vma
 		 * gets called on error from higher level routine
 		 */
-		ret = track_pfn_vma_copy(vma);
+		ret = track_pfn_vma_copy(src_vma);
 		if (ret)
 			return ret;
 	}
@@ -735,7 +922,7 @@ int copy_page_range(struct mm_struct *ds
 	 * parent mm. And a permission downgrade will only happen if
 	 * is_cow_mapping() returns true.
 	 */
-	if (is_cow_mapping(vma->vm_flags))
+	if (is_cow_mapping(src_vma->vm_flags))
 		mmu_notifier_invalidate_range_start(src_mm, addr, end);
 
 	ret = 0;
@@ -746,15 +933,16 @@ int copy_page_range(struct mm_struct *ds
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
 		if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
-					    vma, addr, next))) {
+					    dst_vma, src_vma, addr, next))) {
 			ret = -ENOMEM;
 			break;
 		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
-	if (is_cow_mapping(vma->vm_flags))
+	if (is_cow_mapping(src_vma->vm_flags))
 		mmu_notifier_invalidate_range_end(src_mm,
-						  vma->vm_start, end);
+						  src_vma->vm_start, end);
+
 	return ret;
 }
 
@@ -1199,8 +1387,6 @@ static inline int use_zero_page(struct v
 	return !vma->vm_ops || !vma->vm_ops->fault;
 }
 
-
-
 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		     unsigned long start, int len, int flags,
 		struct page **pages, struct vm_area_struct **vmas)
@@ -1225,6 +1411,7 @@ int __get_user_pages(struct task_struct 
 	do {
 		struct vm_area_struct *vma;
 		unsigned int foll_flags;
+		int decow;
 
 		vma = find_extend_vma(mm, start);
 		if (!vma && in_gate_area(tsk, start)) {
@@ -1279,6 +1466,14 @@ int __get_user_pages(struct task_struct 
 			continue;
 		}
 
+		/*
+		 * Except in special cases where the caller will not read to or
+		 * write from these pages, we must break COW for any pages
+		 * returned from get_user_pages, so that our caller does not
+		 * subsequently end up with the pages of a parent or child
+		 * process after a COW takes place.
+		 */
+		decow = (pages && is_cow_mapping(vma->vm_flags));
 		foll_flags = FOLL_TOUCH;
 		if (pages)
 			foll_flags |= FOLL_GET;
@@ -1299,7 +1494,7 @@ int __get_user_pages(struct task_struct 
 					fatal_signal_pending(current)))
 				return i ? i : -ERESTARTSYS;
 
-			if (write)
+			if (write || decow)
 				foll_flags |= FOLL_WRITE;
 
 			cond_resched();
@@ -1342,6 +1537,8 @@ int __get_user_pages(struct task_struct 
 			if (pages) {
 				pages[i] = page;
 
+				if (decow && !PageDontCOW(page))
+					SetPageDontCOW(page);
 				flush_anon_page(vma, page, start);
 				flush_dcache_page(page);
 			}
@@ -1370,7 +1567,6 @@ int get_user_pages(struct task_struct *t
 				start, len, flags,
 				pages, vmas);
 }
-
 EXPORT_SYMBOL(get_user_pages);
 
 pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
@@ -1829,45 +2025,6 @@ static inline int pte_unmap_same(struct 
 }
 
 /*
- * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
- * servicing faults for write access.  In the normal case, do always want
- * pte_mkwrite.  But get_user_pages can cause write faults for mappings
- * that do not have writing enabled, when used by access_process_vm.
- */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
-{
-	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
-	return pte;
-}
-
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct 
vm_area_struct *vma)
-{
-	/*
-	 * If the source page was a PFN mapping, we don't have
-	 * a "struct page" for it. We do a best-effort copy by
-	 * just copying from the original user address. If that
-	 * fails, we just zero-fill it. Live with it.
-	 */
-	if (unlikely(!src)) {
-		void *kaddr = kmap_atomic(dst, KM_USER0);
-		void __user *uaddr = (void __user *)(va & PAGE_MASK);
-
-		/*
-		 * This really shouldn't fail, because the page is there
-		 * in the page tables. But it might just be unreadable,
-		 * in which case we just give up and fill the result with
-		 * zeroes.
-		 */
-		if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE))
-			memset(kaddr, 0, PAGE_SIZE);
-		kunmap_atomic(kaddr, KM_USER0);
-		flush_dcache_page(dst);
-	} else
-		copy_user_highpage(dst, src, va, vma);
-}
-
-/*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
  * and decrementing the shared-page counter for the old page.
@@ -1930,6 +2087,8 @@ static int do_wp_page(struct mm_struct *
 		}
 		reuse = reuse_swap_page(old_page);
 		unlock_page(old_page);
+		VM_BUG_ON(PageDontCOW(old_page) && !reuse);
+
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
 		/*
@@ -2936,7 +3095,8 @@ int make_pages_present(unsigned long add
 	BUG_ON(end > vma->vm_end);
 	len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;
 	ret = get_user_pages(current, current->mm, addr,
-			len, write, 0, NULL, NULL);
+			len, write, 0,
+			NULL, NULL);
 	if (ret < 0)
 		return ret;
 	return ret == len ? 0 : -EFAULT;
@@ -3086,7 +3246,7 @@ int access_process_vm(struct task_struct
 		struct page *page = NULL;
 
 		ret = get_user_pages(tsk, mm, addr, 1,
-				write, 1, &page, &vma);
+				0, 1, &page, &vma);
 		if (ret <= 0) {
 			/*
 			 * Check if this is a VM_IO | VM_PFNMAP VMA, which
Index: linux-2.6/arch/x86/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/gup.c	2009-03-14 02:48:06.000000000 +1100
+++ linux-2.6/arch/x86/mm/gup.c	2009-03-14 16:21:40.000000000 +1100
@@ -83,11 +83,14 @@ static noinline int gup_pte_range(pmd_t 
 		struct page *page;
 
 		if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) {
+failed:
 			pte_unmap(ptep);
 			return 0;
 		}
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
+		if (PageAnon(page) && unlikely(!PageDontCOW(page)))
+			goto failed;
 		get_page(page);
 		pages[*nr] = page;
 		(*nr)++;
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2009-03-14 02:48:06.000000000 +1100
+++ linux-2.6/include/linux/page-flags.h	2009-03-14 02:48:13.000000000 +1100
@@ -94,6 +94,7 @@ enum pageflags {
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_buddy,		/* Page is free, on buddy lists */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
+	PG_dontcow,		/* PageAnon page in a VM_DONTCOW vma */
 #ifdef CONFIG_UNEVICTABLE_LRU
 	PG_unevictable,		/* Page is "unevictable"  */
 	PG_mlocked,		/* Page is vma mlocked */
@@ -208,6 +209,8 @@ __PAGEFLAG(SlubDebug, slub_debug)
  */
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
 __PAGEFLAG(Buddy, buddy)
+__PAGEFLAG(DontCOW, dontcow)
+SETPAGEFLAG(DontCOW, dontcow)
 PAGEFLAG(MappedToDisk, mappedtodisk)
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-03-14 02:48:06.000000000 +1100
+++ linux-2.6/kernel/fork.c	2009-03-14 15:12:09.000000000 +1100
@@ -353,7 +353,7 @@ static int dup_mmap(struct mm_struct *mm
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		retval = copy_page_range(mm, oldmm, mpnt);
+		retval = copy_page_range(mm, oldmm, tmp, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
Index: linux-2.6/mm/internal.h
===================================================================
--- linux-2.6.orig/mm/internal.h	2009-03-13 20:25:00.000000000 +1100
+++ linux-2.6/mm/internal.h	2009-03-17 02:41:48.000000000 +1100
@@ -15,6 +15,8 @@
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
+void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long addr, pte_t *ptep);
 
 extern void prep_compound_page(struct page *page, unsigned long order);
 extern void prep_compound_gigantic_page(struct page *page, unsigned long order);
Index: linux-2.6/arch/powerpc/mm/gup.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/gup.c	2009-03-17 01:00:48.000000000 +1100
+++ linux-2.6/arch/powerpc/mm/gup.c	2009-03-17 01:02:10.000000000 +1100
@@ -39,6 +39,8 @@ static noinline int gup_pte_range(pmd_t 
 			return 0;
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
+		if (PageAnon(page) && unlikely(!PageDontCOW(page)))
+			return 0;
 		if (!page_cache_get_speculative(page))
 			return 0;
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c	2009-03-17 02:37:21.000000000 +1100
+++ linux-2.6/mm/fremap.c	2009-03-17 02:42:11.000000000 +1100
@@ -23,32 +23,6 @@
 
 #include "internal.h"
 
-static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long addr, pte_t *ptep)
-{
-	pte_t pte = *ptep;
-
-	if (pte_present(pte)) {
-		struct page *page;
-
-		flush_cache_page(vma, addr, pte_pfn(pte));
-		pte = ptep_clear_flush(vma, addr, ptep);
-		page = vm_normal_page(vma, addr, pte);
-		if (page) {
-			if (pte_dirty(pte))
-				set_page_dirty(page);
-			page_remove_rmap(page);
-			page_cache_release(page);
-			update_hiwater_rss(mm);
-			dec_mm_counter(mm, file_rss);
-		}
-	} else {
-		if (!pte_file(pte))
-			free_swap_and_cache(pte_to_swp_entry(pte));
-		pte_clear_not_present_full(mm, addr, ptep, 0);
-	}
-}
-
 /*
  * Install a file pte to a given virtual memory address, release any
  * previously existing mapping.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-14  5:20                   ` Nick Piggin
@ 2009-03-16 16:01                     ` KOSAKI Motohiro
  2009-03-16 16:23                       ` Nick Piggin
  2009-03-17  0:44                       ` Linus Torvalds
  0 siblings, 2 replies; 83+ messages in thread
From: KOSAKI Motohiro @ 2009-03-16 16:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, Benjamin Herrenschmidt, Linus Torvalds,
	Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-mm

Hi

> > IB folks so far have been avoiding the fork() trap thanks to
> > madvise(MADV_DONTFORK) afaik. And it all goes generally well when the
> > whole application knows what it's doing and just plain avoids fork.
> >
> > -But- things get nasty if for some reason, the user of gup is somewhere
> > deep in some kind of library that an application uses without knowing,
> > while forking here or there to run shell scripts or other helpers.
> >
> > I've seen it :-)
> >
> > So if a solution can be found that doesn't uglify the whole thing beyond
> > recognition, it's probably worth it.
> 
> AFAIKS, the approach I've posted is probably the simplest (and maybe only
> way) to really fix it. It's not too ugly.

May I join this discussion?

if we only need concern to O_DIRECT, below patch is enough.

Yes, my patch isn't realy solusion.
Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork issue.
*and* my patch is crazy slow :)

So, my point is, I merely oppose easily decision to give up fixing.

Currently, I agree we don't have easily fixinig way.
but I believe we can solve this problem completely in the nealy future 
because LKML folks are very cool guys.

Thus, I don't hope to append the "BUGS" section of the O_DIRECT man page.
Also I don't hope that I says "Oh, Solaris can solve your requirement,
AIX can, FreeBSD can, but Linux can't".
it beat my proud of linux developer a bit ;)

andorea's patch seems a bit complex than your. but I think it can
improve later.
but the man page change can't undo.


In addition, May I talk about my gup-fast concern?
AFAIK, the worth of gup-fast is not removing one atomic operation.
not grabbing mmap_sem is essetial.

it because:
  - block layer and i/o driver also have several lock.
    then, DirectIO take many atomic operations anyway.
    one atomic operation cost is not so expensive.
  - but mmap_sem is one of most easy contented lock in linux.
    because
    - almost modern DB software have multi threading.
    - glibc malloc/free can cause mmap, munmap, mprotect syscall.
      its syscall grab down_write(&mmap_sem).
    - page fault also grab down_read(&mmap_sem).
    - anyway, userland application can't avoid malloc() and pagefault.

However, I haven't seen anyone try to munmap() to direct-io region.
So, it imply mmap_sem can split out fine grainy.
(or, Can we remove it completely? iirc PerterZ tryed it about two month ago)

after that, we can grab mmap_sem without performace degression and 
many mmap_sem avoiding effort can be removed.

perhaps, I talk funny thing. gup-fast was introduced for solving DB2 problem.
but I don't have any DB2 development experience.

Am I over-optimistic?



> You can't easily fix it at write-time by COWing in the right direction like
> Linus suggested because at that point you may have multiple get_user_pages
> (for read) from the parent and child on the page, so there is no way to COW
> it in the right direction.
> 
> You could do something crazy like allowing only one get_user_pages read on a
> wp page, and recording which direction to send it if it does get COWed. But
> at that point you've got something that's far uglier in the core code and
> more complex than what I posted.





---
 fs/direct-io.c            |    2 ++
 include/linux/init_task.h |    1 +
 include/linux/mm_types.h  |    3 +++
 kernel/fork.c             |    3 +++
 4 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index b6d4390..8f9a810 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1206,8 +1206,10 @@ __blockdev_direct_IO(int rw, struct kiocb
*iocb, struct inode *inode,
 	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
 		(end > i_size_read(inode)));

+	down_read(&current->mm->directio_sem);
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
+	up_read(&current->mm->directio_sem);

 	/*
 	 * In case of error extending write may have instantiated a few
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index e752d97..68e02b9 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -37,6 +37,7 @@ extern struct fs_struct init_fs;
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(name.page_table_lock),	\
 	.mmlist		= LIST_HEAD_INIT(name.mmlist),		\
 	.cpu_vm_mask	= CPU_MASK_ALL,				\
+	.directio_sem	= __RWSEM_INITIALIZER(name.directio_sem), \
 }

 #define INIT_SIGNALS(sig) {						\
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d84feb7..39ba4e6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -274,6 +274,9 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+
+	/* if there are on-flight directio, we can't fork. */
+	struct rw_semaphore directio_sem;
 };

 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 4854c2c..bbe9fa7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct
mm_struct *oldmm)
 	unsigned long charge;
 	struct mempolicy *pol;

+	down_write(&oldmm->directio_sem);
 	down_write(&oldmm->mmap_sem);
 	flush_cache_dup_mm(oldmm);
 	/*
@@ -368,6 +369,7 @@ out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);
 	up_write(&oldmm->mmap_sem);
+	up_write(&oldmm->directio_sem);
 	return retval;
 fail_nomem_policy:
 	kmem_cache_free(vm_area_cachep, tmp);
@@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct
* mm, struct task_struct *p)
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_owner(mm, p);
+	init_rwsem(&mm->directio_sem);

 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
-- 
1.6.0.6



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 16:01                     ` KOSAKI Motohiro
@ 2009-03-16 16:23                       ` Nick Piggin
  2009-03-16 16:32                         ` Linus Torvalds
  2009-03-18  2:04                         ` KOSAKI Motohiro
  2009-03-17  0:44                       ` Linus Torvalds
  1 sibling, 2 replies; 83+ messages in thread
From: Nick Piggin @ 2009-03-16 16:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Benjamin Herrenschmidt, Linus Torvalds, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tuesday 17 March 2009 03:01:42 KOSAKI Motohiro wrote:
> Hi
>

> > AFAIKS, the approach I've posted is probably the simplest (and maybe only
> > way) to really fix it. It's not too ugly.
>
> May I join this discussion?

Of course :)


> if we only need concern to O_DIRECT, below patch is enough.
>
> Yes, my patch isn't realy solusion.
> Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork
> issue. *and* my patch is crazy slow :)

Well, it's an interesting question. I'd say it probably is more than
just O_DIRECT. vmsplice too, for example (which I think is much harder
to fix this way because the pages are retired by the other end of
the pipe, so I don't think you can hold a lock across it).

For other device drivers, one could argue that they are "special" and
require special knowledge and apps to use MADV_DONTFORK... Ben didn't
like that so much, and also some other users of get_user_pages might
come up.

But your patch is interesting. I don't think it is crazy slow... well
it might be a bit slow in the case that a threaded app doing a lot of
direct IO or an app doing async IO forks. But how common is that?

I would be slightly more worried about the common cacheline touched
to take the read lock for multithreaded direct IO, but I'm not sure
how much that will hurt DB2.


> So, my point is, I merely oppose easily decision to give up fixing.
>
> Currently, I agree we don't have easily fixinig way.
> but I believe we can solve this problem completely in the nealy future
> because LKML folks are very cool guys.
>
> Thus, I don't hope to append the "BUGS" section of the O_DIRECT man page.
> Also I don't hope that I says "Oh, Solaris can solve your requirement,
> AIX can, FreeBSD can, but Linux can't".
> it beat my proud of linux developer a bit ;)
>
> andorea's patch seems a bit complex than your. but I think it can
> improve later.
> but the man page change can't undo.
>
>
> In addition, May I talk about my gup-fast concern?
> AFAIK, the worth of gup-fast is not removing one atomic operation.
> not grabbing mmap_sem is essetial.

Yes, mmap_sem is the big thing. But straight line speed is important
too.

[...]

> ---
>  fs/direct-io.c            |    2 ++
>  include/linux/init_task.h |    1 +
>  include/linux/mm_types.h  |    3 +++
>  kernel/fork.c             |    3 +++
>  4 files changed, 9 insertions(+), 0 deletions(-)

It is an interesting patch. Thanks for throwing it into the discussion.
I do prefer to close the race up for all cases if we decide to do
anything at all about it, ie. all or nothing. But maybe others disagree.


> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index b6d4390..8f9a810 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1206,8 +1206,10 @@ __blockdev_direct_IO(int rw, struct kiocb
> *iocb, struct inode *inode,
>  	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
>  		(end > i_size_read(inode)));
>
> +	down_read(&current->mm->directio_sem);
>  	retval = direct_io_worker(rw, iocb, inode, iov, offset,
>  				nr_segs, blkbits, get_block, end_io, dio);
> +	up_read(&current->mm->directio_sem);
>
>  	/*
>  	 * In case of error extending write may have instantiated a few
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index e752d97..68e02b9 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -37,6 +37,7 @@ extern struct fs_struct init_fs;
>  	.page_table_lock =  __SPIN_LOCK_UNLOCKED(name.page_table_lock),	\
>  	.mmlist		= LIST_HEAD_INIT(name.mmlist),		\
>  	.cpu_vm_mask	= CPU_MASK_ALL,				\
> +	.directio_sem	= __RWSEM_INITIALIZER(name.directio_sem), \
>  }
>
>  #define INIT_SIGNALS(sig) {						\
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d84feb7..39ba4e6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -274,6 +274,9 @@ struct mm_struct {
>  #ifdef CONFIG_MMU_NOTIFIER
>  	struct mmu_notifier_mm *mmu_notifier_mm;
>  #endif
> +
> +	/* if there are on-flight directio, we can't fork. */
> +	struct rw_semaphore directio_sem;
>  };
>
>  /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 4854c2c..bbe9fa7 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct
> mm_struct *oldmm)
>  	unsigned long charge;
>  	struct mempolicy *pol;
>
> +	down_write(&oldmm->directio_sem);
>  	down_write(&oldmm->mmap_sem);
>  	flush_cache_dup_mm(oldmm);
>  	/*
> @@ -368,6 +369,7 @@ out:
>  	up_write(&mm->mmap_sem);
>  	flush_tlb_mm(oldmm);
>  	up_write(&oldmm->mmap_sem);
> +	up_write(&oldmm->directio_sem);
>  	return retval;
>  fail_nomem_policy:
>  	kmem_cache_free(vm_area_cachep, tmp);
> @@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct
> * mm, struct task_struct *p)
>  	mm->free_area_cache = TASK_UNMAPPED_BASE;
>  	mm->cached_hole_size = ~0UL;
>  	mm_init_owner(mm, p);
> +	init_rwsem(&mm->directio_sem);
>
>  	if (likely(!mm_alloc_pgd(mm))) {
>  		mm->def_flags = 0;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 16:23                       ` Nick Piggin
@ 2009-03-16 16:32                         ` Linus Torvalds
  2009-03-16 16:50                           ` Nick Piggin
  2009-03-18  2:04                         ` KOSAKI Motohiro
  1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-16 16:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Nick Piggin wrote:
> > Yes, my patch isn't realy solusion.
> > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs fork
> > issue. *and* my patch is crazy slow :)
> 
> Well, it's an interesting question. I'd say it probably is more than
> just O_DIRECT. vmsplice too, for example (which I think is much harder
> to fix this way because the pages are retired by the other end of
> the pipe, so I don't think you can hold a lock across it).

Well, only the "fork()" has the race problem.

So having a fork-specific lock (but not naming it by directio) actually 
does make sense. The fork is much less performance-critical than most 
random mmap_sem users - and doesn't have the same scalability issues 
either (ie people probably _do_ want to do mmap/munmap/brk concurrently 
with gup lookup, but there's much less worry about concurrent fork() 
performance).

It doesn't necessarily make the general problem go away, but it makes the 
_particular_ race between get_user_pages() and fork() go away. Then you 
can do per-page flags or whatever and not have to worry about concurrent 
lookups.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 16:32                         ` Linus Torvalds
@ 2009-03-16 16:50                           ` Nick Piggin
  2009-03-16 17:02                             ` Linus Torvalds
  2009-03-16 23:59                             ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 83+ messages in thread
From: Nick Piggin @ 2009-03-16 16:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tuesday 17 March 2009 03:32:11 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > > Yes, my patch isn't realy solusion.
> > > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs
> > > fork issue. *and* my patch is crazy slow :)
> >
> > Well, it's an interesting question. I'd say it probably is more than
> > just O_DIRECT. vmsplice too, for example (which I think is much harder
> > to fix this way because the pages are retired by the other end of
> > the pipe, so I don't think you can hold a lock across it).
>
> Well, only the "fork()" has the race problem.
>
> So having a fork-specific lock (but not naming it by directio) actually
> does make sense. The fork is much less performance-critical than most
> random mmap_sem users - and doesn't have the same scalability issues
> either (ie people probably _do_ want to do mmap/munmap/brk concurrently
> with gup lookup, but there's much less worry about concurrent fork()
> performance).
>
> It doesn't necessarily make the general problem go away, but it makes the
> _particular_ race between get_user_pages() and fork() go away. Then you
> can do per-page flags or whatever and not have to worry about concurrent
> lookups.

Hmm, I see what you mean there; it can be used to solve Andrea's race
instead of using set_bit/memory barriers. But I think then you would
still need to put this lock in fork and get_user_pages[_fast], *and*
still do most of the other stuff required in Andrea's patch.

So I'm not sure if that was KAMEZAWA-san's patch.

It actually should solve one side of the race completely, as is, but
only for direct-IO. Because it ensures that no get_user_pages for direct
IO can be outstanding over a fork. However it does a) not solve other
get_user_pages problems, and b) doesn't solve the case where for
readonly get_user_pages on an already shared pte will get confused if it
is subsequently COWed -- it can end up being polluted with wrong data.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 16:50                           ` Nick Piggin
@ 2009-03-16 17:02                             ` Linus Torvalds
  2009-03-16 17:19                               ` Nick Piggin
  2009-03-16 23:59                             ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-16 17:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Nick Piggin wrote:
> 
> Hmm, I see what you mean there; it can be used to solve Andrea's race
> instead of using set_bit/memory barriers. But I think then you would
> still need to put this lock in fork and get_user_pages[_fast], *and*
> still do most of the other stuff required in Andrea's patch.

Well, yes and no. 

What if we just did the caller get the lock? And then leave it entirely to 
the caller to decide how it wants to synchronize with fork?

In particular, we really _could_ just say "hold the lock for reading for 
as long as you hold the reference count to the page" - since now the lock 
only matters for fork(), nothing else.

And make the forking part use "down_write_killable()", so that you can 
kill the process if it does something bad.

Now you can make vmsplice literally get a read-lock for the whole IO 
operation. The process that does "vmsplice()" will not be able to fork 
until the IO is done, but let's be honest here: if you're doing 
vmsplice(), that is damn well what you WANT!

splice() already has a callback for releasing the pages, so it's doable.

O_DIRECT has similar issues - by the time we return from an O_DIRECT 
write, the pages had better already be written out, so we could just take 
the read-lock over the whole operation.

So don't take the lock in the low level get_user_pages(). Take it as high 
as you want to.

And if some user doesn't want that serialization (maybe ptrace?), don't 
take the lock at all, or take it just over the get_user_pages() call.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 17:02                             ` Linus Torvalds
@ 2009-03-16 17:19                               ` Nick Piggin
  2009-03-16 17:42                                 ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Nick Piggin @ 2009-03-16 17:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tuesday 17 March 2009 04:02:02 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > Hmm, I see what you mean there; it can be used to solve Andrea's race
> > instead of using set_bit/memory barriers. But I think then you would
> > still need to put this lock in fork and get_user_pages[_fast], *and*
> > still do most of the other stuff required in Andrea's patch.
>
> Well, yes and no.
>
> What if we just did the caller get the lock? And then leave it entirely to
> the caller to decide how it wants to synchronize with fork?
>
> In particular, we really _could_ just say "hold the lock for reading for
> as long as you hold the reference count to the page" - since now the lock
> only matters for fork(), nothing else.

Well that in theory should close the race in one direction (writing into
the wrong page).

I don't think it closes it in the other direction (reading the wrong data
from the page).

I'm also not quite convinced of vmsplice.


> And make the forking part use "down_write_killable()", so that you can
> kill the process if it does something bad.
>
> Now you can make vmsplice literally get a read-lock for the whole IO
> operation. The process that does "vmsplice()" will not be able to fork
> until the IO is done, but let's be honest here: if you're doing
> vmsplice(), that is damn well what you WANT!

Really? I'm not sure (probably primarily because I've never really seen
how vmsplice would be used).

splice is supposed to be asynchronous, so I don't know why you necessarily
would want to avoid fork after a splice (until the asynchronous reader on
the other end that you don't necessarily have control over or know anything
about reads all the data you've sent it).


> splice() already has a callback for releasing the pages, so it's doable.

doable, maybe.


> O_DIRECT has similar issues - by the time we return from an O_DIRECT
> write, the pages had better already be written out, so we could just take
> the read-lock over the whole operation.

Yes I think that's what the patch was doing.


> So don't take the lock in the low level get_user_pages(). Take it as high
> as you want to.
>
> And if some user doesn't want that serialization (maybe ptrace?), don't
> take the lock at all, or take it just over the get_user_pages() call.

BTW. have you looked at my approach yet? I've tried to solve the fork
vs gup race in yet another way. Don't know if you think it is palatable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 17:19                               ` Nick Piggin
@ 2009-03-16 17:42                                 ` Linus Torvalds
  2009-03-16 18:02                                   ` Nick Piggin
  2009-03-16 18:28                                   ` Andrea Arcangeli
  0 siblings, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2009-03-16 17:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Nick Piggin wrote:
> 
> Well that in theory should close the race in one direction (writing into
> the wrong page).
> 
> I don't think it closes it in the other direction (reading the wrong data
> from the page).

Why?

If somebody does a COW while we have a get_user_pages() page frame cached, 
the get_user_pages() will have increased the page count, so regardless of 
_who_ writes to the page, the writer will always get a new page. No?

So reading data from the page will always get the old pre-cow data. 

[ goes to reading code ]

Oh, damn. That's how it used to work a long time ago when we looked at the 
page count. Now we just look at the page *map* count, we don't look at any 
other counts. So the COW logic won't see that somebody else has a copy.

Maybe we could go back to also looking at page counts?

> BTW. have you looked at my approach yet? I've tried to solve the fork
> vs gup race in yet another way. Don't know if you think it is palatable.

I really think we should be able to fix this without _anything_ like that 
at all. Just the lock (and some reuse_swap_page() logic changes).

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 17:42                                 ` Linus Torvalds
@ 2009-03-16 18:02                                   ` Nick Piggin
  2009-03-16 18:05                                     ` Nick Piggin
  2009-03-16 18:14                                     ` Linus Torvalds
  2009-03-16 18:28                                   ` Andrea Arcangeli
  1 sibling, 2 replies; 83+ messages in thread
From: Nick Piggin @ 2009-03-16 18:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tuesday 17 March 2009 04:42:48 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > Well that in theory should close the race in one direction (writing into
> > the wrong page).
> >
> > I don't think it closes it in the other direction (reading the wrong data
> > from the page).
>
> Why?
>
> If somebody does a COW while we have a get_user_pages() page frame cached,
> the get_user_pages() will have increased the page count, so regardless of
> _who_ writes to the page, the writer will always get a new page. No?

[(no)]


> Maybe we could go back to also looking at page counts?

Hmm, possibly could.


> > BTW. have you looked at my approach yet? I've tried to solve the fork
> > vs gup race in yet another way. Don't know if you think it is palatable.
>
> I really think we should be able to fix this without _anything_ like that
> at all. Just the lock (and some reuse_swap_page() logic changes).

What part of that do you dislike, though? I don't think the lock is a
particularly elegant idea either (shared cacheline, vmsplice, converting
callers).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 18:02                                   ` Nick Piggin
@ 2009-03-16 18:05                                     ` Nick Piggin
  2009-03-16 18:17                                       ` Linus Torvalds
  2009-03-16 18:14                                     ` Linus Torvalds
  1 sibling, 1 reply; 83+ messages in thread
From: Nick Piggin @ 2009-03-16 18:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tuesday 17 March 2009 05:02:56 Nick Piggin wrote:
> On Tuesday 17 March 2009 04:42:48 Linus Torvalds wrote:
> > On Tue, 17 Mar 2009, Nick Piggin wrote:

> > > BTW. have you looked at my approach yet? I've tried to solve the fork
> > > vs gup race in yet another way. Don't know if you think it is
> > > palatable.
> >
> > I really think we should be able to fix this without _anything_ like that
> > at all. Just the lock (and some reuse_swap_page() logic changes).
>
> What part of that do you dislike, though?

If you disregard code motion and extra argument to copy_page_range,
my fix is a couple of dozen lines change to existing code, plus the
"decow" function (which could probably share a fair bit of code
with do_wp_page).

Do you dislike the added complexity of the code? Or the behaviour
that gets changed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 18:02                                   ` Nick Piggin
  2009-03-16 18:05                                     ` Nick Piggin
@ 2009-03-16 18:14                                     ` Linus Torvalds
  2009-03-16 18:29                                       ` Nick Piggin
  2009-03-16 18:37                                       ` Andrea Arcangeli
  1 sibling, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2009-03-16 18:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Nick Piggin wrote:
> 
> What part of that do you dislike, though? I don't think the lock is a
> particularly elegant idea either (shared cacheline, vmsplice, converting
> callers).

All of the absolute *crap* for no good reason.

Did you even look at your patch? It wasn't as ugly as Andrea's, but it was 
ugly enough, and it was buggy. That whole "decow" stuff was too f*cking 
ugly to live.

Couple that with the fact that no real-life user can possibly care, and 
that O_DIRECT is broken to begin with, and I say: "let's fix this with a 
_much_ smaller patch".

You may think that the lock isn't particularly "elegant", but I can only 
say "f*ck that, look at the number of lines of code, and the simplicity".

Your "elegant" argument is total and utter sh*t, in other words. The lock 
approach is tons more elegant, considering that it solves the problem much 
more cleanly, and with _much_ less crap.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 18:05                                     ` Nick Piggin
@ 2009-03-16 18:17                                       ` Linus Torvalds
  2009-03-16 18:33                                         ` Nick Piggin
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-16 18:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Nick Piggin wrote:
> 
> If you disregard code motion and extra argument to copy_page_range,
> my fix is a couple of dozen lines change to existing code, plus the
> "decow" function (which could probably share a fair bit of code
> with do_wp_page).
> 
> Do you dislike the added complexity of the code? Or the behaviour
> that gets changed?

The complexity. That decow thing is shit. So is all the extra flags for no 
good reason. 

What's your argument against "keep it simple with a single lock, and 
adding basically a single line to reuse_swap_page() to say "don't reuse 
the page if the count is elevated"?

THAT is simple and elegant, and needs none of the complexity.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 17:42                                 ` Linus Torvalds
  2009-03-16 18:02                                   ` Nick Piggin
@ 2009-03-16 18:28                                   ` Andrea Arcangeli
  1 sibling, 0 replies; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-16 18:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, KOSAKI Motohiro, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Mon, Mar 16, 2009 at 10:42:48AM -0700, Linus Torvalds wrote:
> Maybe we could go back to also looking at page counts?

Hugh just recently reminded me why we switched to mapcount and
explanation is here: c475a8ab625d567eacf5e30ec35d6d8704558062
which wasn't entirely safe until this was added too:
ab967d86015a19777955370deebc8262d50fed63 which reliably allowed to
takeover swapcache pages taken by gup and at the same time it allowed
the VM to unmap ptes pointing to swapcache taken by GUP.

Yes it's possible to go back to page counts, then we have only to
reintroduce by 2.6.7 solution that will prevent the VM to unmap ptes
that are mapping pages take by GUP. Otherwise do_wp_page won't be able
to remap into the pte the same swapcache that was unmapped by the pte
by the VM leading to disk corruption with swapping (the 2.4 bug, fixed
in 2.4 with a simpler PG_lock local to direct-io, that prevented the
VM to unmap ptes on the page as long as I/O was in progress, and
PG_lock was released by the ->end_io async handler from irq IIRC).

The only problem I can see is if mapcount and page count can change
freely while PT lock and rmap locks are taken, comparing them won't be
as reliable as in ksm/fork (in my version of the fix) where we're
guaranteed mapcount is 1 and stays 1 as long as we hold PT lock,
because pte_write(pte) == true and PageAnon == true (I also added a
BUG_ON to check mapcount to be always 1 with the other two conditions
are true). That makes ksm/forkfix quite obviously safe in this regard.

But for the VM to decide not to unmap a pte taken by GUP, we also have
to deal with a mapcount > 1 and pte_write(pte) == false and PageAnon
== true. So if we solve that ordering issue between reading mapcount
and page count I don't see much of a problem to returning checking the
page count in the VM code to prevent the pte to be unmapped while page
is under GUP and then remove the mapcount-only check from do_wp_page
swapcache-reuse logic.

If we'd return using the page_count instead of mapcount, my first
patch I posted here would then not require any change to take care of
the 'reverse' race (modulo hugetlb) of the child writing to the pages
that are being written to disk by the parent, there would be no need
to de-cow in GUP (again modulo hugetlb).

> I really think we should be able to fix this without _anything_ like that 
> at all. Just the lock (and some reuse_swap_page() logic changes).

I don't see why we should introduce mm wide locks outside GUP
(worrying about the SetPageGUP in gup-fast when gup-fast would then
instead have to take a mm-wide lock sounds small issue) when we can be
page-granular and lockless. I agree it could be simpler and less
invasive into the gup details to add any logic outside of gup, but I
don't think the result will be superior, given it'll most certainly
become an havier-weight lock bouncing across all cpus calling
gup-fast, and it won't make a speed difference for the CPU to execute
an atomic lock op inner or outer of gup-fast. OTOH if the argument for
an outer mm wide lock is to keep the code simpler or more
maintainable, that would explain it. I think fixing it my way is not
more complicated than by fixing outside gup, but then I clearly may be
biased in what it looks simpler to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 18:14                                     ` Linus Torvalds
@ 2009-03-16 18:29                                       ` Nick Piggin
  2009-03-16 19:17                                         ` Linus Torvalds
  2009-03-16 18:37                                       ` Andrea Arcangeli
  1 sibling, 1 reply; 83+ messages in thread
From: Nick Piggin @ 2009-03-16 18:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tuesday 17 March 2009 05:14:59 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > What part of that do you dislike, though? I don't think the lock is a
> > particularly elegant idea either (shared cacheline, vmsplice, converting
> > callers).
>
> All of the absolute *crap* for no good reason.
>
> Did you even look at your patch? It wasn't as ugly as Andrea's, but it was
> ugly enough, and it was buggy. That whole "decow" stuff was too f*cking
> ugly to live.

What's buggy about it? Stupid bugs, or fundamentally broken?


> Couple that with the fact that no real-life user can possibly care, and
> that O_DIRECT is broken to begin with, and I say: "let's fix this with a
> _much_ smaller patch".

If it is based on nobody caring, I would prefer not to add anything at
all to "fix" it? We have MADV_DONTFORK already...


> You may think that the lock isn't particularly "elegant", but I can only
> say "f*ck that, look at the number of lines of code, and the simplicity".
>
> Your "elegant" argument is total and utter sh*t, in other words. The lock
> approach is tons more elegant, considering that it solves the problem much
> more cleanly, and with _much_ less crap.

In my opinion it is not, given that you have to convert callers. If you
say that you only care about fixing O_DIRECT, then yes I would probably
agree the lock is nicer in that case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 18:17                                       ` Linus Torvalds
@ 2009-03-16 18:33                                         ` Nick Piggin
  2009-03-16 19:22                                           ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Nick Piggin @ 2009-03-16 18:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tuesday 17 March 2009 05:17:02 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > If you disregard code motion and extra argument to copy_page_range,
> > my fix is a couple of dozen lines change to existing code, plus the
> > "decow" function (which could probably share a fair bit of code
> > with do_wp_page).
> >
> > Do you dislike the added complexity of the code? Or the behaviour
> > that gets changed?
>
> The complexity. That decow thing is shit.

copying the page on fork instead of write protecting it? The code or
the idea? Code can certainly be improved...


> So is all the extra flags for no
> good reason.

Which extra flags are you referring to?


> What's your argument against "keep it simple with a single lock, and
> adding basically a single line to reuse_swap_page() to say "don't reuse
> the page if the count is elevated"?

I made them in a previous message. It depends on what callers you want
to convert I guess. I don't think vmsplice takes to the lock approach
very well though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 18:14                                     ` Linus Torvalds
  2009-03-16 18:29                                       ` Nick Piggin
@ 2009-03-16 18:37                                       ` Andrea Arcangeli
  1 sibling, 0 replies; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-16 18:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, KOSAKI Motohiro, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Mon, Mar 16, 2009 at 11:14:59AM -0700, Linus Torvalds wrote:
> You may think that the lock isn't particularly "elegant", but I can only 
> say "f*ck that, look at the number of lines of code, and the simplicity".

I'm sorry but the number of lines that you're reading in the
direct_io_worker patch, aren't representative of what it takes to fix
it with a mm wide lock. It may be conceptually simpler to fix it
outside GUP, on that I can certainly agree (with the downside of
leaving splice broken etc..), but I can't see how that small patch can
fix anything as releasing the semaphore after direct_io_worker returns
with O_DIRECT mixed with async-io. Before claiming that the outer lock
results in less number of lines of code, I'd wait to see a fix that
works with O_DIRECT+async-io too as well as mine and Nick's do.

> Your "elegant" argument is total and utter sh*t, in other words. The lock 
> approach is tons more elegant, considering that it solves the problem much 
> more cleanly, and with _much_ less crap.

I guess elegant is relative, but the size argument is objective, and
that should be possible to compare if somebody writes a full fix that
doesn't fall apart if return value of direct_io_worker is -EIOCBQUEUED.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 18:29                                       ` Nick Piggin
@ 2009-03-16 19:17                                         ` Linus Torvalds
  2009-03-17  5:42                                           ` Nick Piggin
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-16 19:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Nick Piggin wrote:
> 
> What's buggy about it? Stupid bugs, or fundamentally broken?

The lack of locking.

> In my opinion it is not, given that you have to convert callers. If you
> say that you only care about fixing O_DIRECT, then yes I would probably
> agree the lock is nicer in that case.

F*ck me, I'm not going to bother to argue. I'm not going to merge your 
patch, it's that easy.

Quite frankly, I don't think that the "bug" is a bug to begin with. 
O_DIRECT+fork() can damn well continue to be broken. But if we fix it, we 
fix it the _clean_ way with a simple patch, not with that shit-for-logic 
horrible decow crap.

It's that simple. I refuse to take putrid industrial waste patches for 
something like this.

			Linus


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 18:33                                         ` Nick Piggin
@ 2009-03-16 19:22                                           ` Linus Torvalds
  2009-03-17  5:44                                             ` Nick Piggin
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-16 19:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Nick Piggin wrote:
> 
> > So is all the extra flags for no
> > good reason.
> 
> Which extra flags are you referring to?

Fuck me, didn't you even read your own patch?

What do you call PG_dontcow? 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 16:50                           ` Nick Piggin
  2009-03-16 17:02                             ` Linus Torvalds
@ 2009-03-16 23:59                             ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 83+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-16 23:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, KOSAKI Motohiro, Benjamin Herrenschmidt,
	Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins,
	linux-mm

On Tue, 17 Mar 2009 03:50:12 +1100
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Tuesday 17 March 2009 03:32:11 Linus Torvalds wrote:
> > On Tue, 17 Mar 2009, Nick Piggin wrote:
> > > > Yes, my patch isn't realy solusion.
> > > > Andrea already pointed out that it's not O_DIRECT issue, it's gup vs
> > > > fork issue. *and* my patch is crazy slow :)
> > >
> > > Well, it's an interesting question. I'd say it probably is more than
> > > just O_DIRECT. vmsplice too, for example (which I think is much harder
> > > to fix this way because the pages are retired by the other end of
> > > the pipe, so I don't think you can hold a lock across it).
> >
> > Well, only the "fork()" has the race problem.
> >
> > So having a fork-specific lock (but not naming it by directio) actually
> > does make sense. The fork is much less performance-critical than most
> > random mmap_sem users - and doesn't have the same scalability issues
> > either (ie people probably _do_ want to do mmap/munmap/brk concurrently
> > with gup lookup, but there's much less worry about concurrent fork()
> > performance).
> >
> > It doesn't necessarily make the general problem go away, but it makes the
> > _particular_ race between get_user_pages() and fork() go away. Then you
> > can do per-page flags or whatever and not have to worry about concurrent
> > lookups.
> 
> Hmm, I see what you mean there; it can be used to solve Andrea's race
> instead of using set_bit/memory barriers. But I think then you would
> still need to put this lock in fork and get_user_pages[_fast], *and*
> still do most of the other stuff required in Andrea's patch.
> 
> So I'm not sure if that was KAMEZAWA-san's patch.
> 
Just FYI.

This was the last patch I sent to redhat (againat RHEL5) but ignored ;)
plz ignore the dirty part which comes from limitation that I can't
modify mm_struct.

===
This patch provides a kind of rwlock for DIO.

This patch adds below:
	struct mm_private {
		struct mm_struct
		new our data
	}

  Before issuing dio, dio submitter should call dio_lock()/dio_unlock().
  Before startinc COW, the kennel should call mm_cow_start()/mm_cow_end().

  dio_lock() registers a range of address which is under DIO.
  mm_cow_start() checks range of address is under DIO or not, then
  - If under DIO, retry fault. (for releaseing rwsem.)
  - If not under DIO, mark "we're under COW". This will make DIO submitters
    wait.

  For avoiding too many page faults, "conflict" counter is added and
  if conflict==1, DIO submitter will wait for a while.

  If no one isseus DIO yet at copy-on-write, no checkes.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
--
 fs/direct-io.c             |   43 ++++++++++++++-
 include/linux/direct-io.h  |   38 +++++++++++++
 include/linux/mm_private.h |   24 ++++++++
 kernel/fork.c              |   23 ++++++--
 mm/Makefile                |    2 
 mm/diolock.c               |  129 +++++++++++++++++++++++++++++++++++++++++++++
 mm/hugetlb.c               |   11 +++
 mm/memory.c                |   15 +++++
 8 files changed, 278 insertions(+), 7 deletions(-)

Index: kame-odirect-linux/include/linux/direct-io.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ kame-odirect-linux/include/linux/direct-io.h	2009-01-30 10:12:58.000000000 +0900
@@ -0,0 +1,38 @@
+#ifndef __LINUX_DIRECT_IO_H
+#define __LINUX_DIRECT_IO_H
+
+struct dio_lock_head
+{
+	spinlock_t		lock;		/* A lock for all below */
+	struct list_head	dios;		/* DIOs running now */
+	int			need_dio_check; /* This process used DIO */
+	int			cows;		/* COWs running now */
+	int			conflicts;	/* conflicts between COW and DIOs*/
+	wait_queue_head_t	waitq;		/* A waitq for all stopped DIOs.*/
+};
+
+struct dio_lock_ent
+{
+	struct list_head 	list;		/* Linked list from head->dios */
+	struct mm_struct	*mm;		/* the mm struct this is assgined for */
+	unsigned long		start;		/* start address for a DIO */
+	unsigned long		end;		/* end address for a DIO */
+};
+
+/* called at fork/exit */
+int dio_lock_init(struct dio_lock_head *head);
+void dio_lock_free(struct dio_lock_head *head);
+
+/*
+ * Called by DIO submitter.
+ */
+int dio_lock(struct mm_struct *mm, unsigned long start, unsigned long end,
+		struct dio_lock_ent *lock);
+void dio_unlock(struct dio_lock_ent *lock);
+/*
+ * Called by waiters.
+ */
+int mm_cow_start(struct mm_struct *mm, unsigned long start, unsigned long size);
+void mm_cow_end(struct mm_struct *mm);
+
+#endif
Index: kame-odirect-linux/mm/diolock.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ kame-odirect-linux/mm/diolock.c	2009-01-30 10:43:11.000000000 +0900
@@ -0,0 +1,129 @@
+#include <linux/mm.h>
+#include <linux/wait.h>
+#include <linux/hash.h>
+#include <linux/mm_private.h>
+
+
+int dio_lock_init(struct dio_lock_head *head)
+{
+	spin_lock_init(&head->lock);
+	head->need_dio_check = 0;
+	head->cows = 0;
+	head->conflicts = 0;
+	INIT_LIST_HEAD(&head->dios);
+	init_waitqueue_head(&head->waitq);
+	return 0;
+}
+
+void dio_lock_free(struct dio_lock_head *head)
+{
+	BUG_ON(!list_empty(&head->dios));
+	return;
+}
+
+
+int dio_lock(struct mm_struct *mm, unsigned long start, unsigned long end,
+	     struct dio_lock_ent *lock)
+{
+	unsigned long flags;
+	struct dio_lock_head *head;
+	DEFINE_WAIT(wait);
+retry:
+	if (signal_pending(current))
+		return -EINTR;
+	head  = &get_mm_private(mm)->diolock;
+
+	if (!head->need_dio_check) {
+		down_write(&mm->mmap_sem);
+		head->need_dio_check = 1;
+		up_write(&mm->mmap_sem);
+	}
+
+	prepare_to_wait(&head->waitq, &wait, TASK_INTERRUPTIBLE);
+	spin_lock_irqsave(&head->lock, flags);
+	if (head->cows || head->conflicts) { /* Allow COWs go ahead rather than new I/O */
+		spin_unlock_irqrestore(&head->lock, flags);
+		if (head->cows)
+			schedule();
+		else {
+			schedule_timeout(10); /* Allow 10tick for COW rertry */
+			head->conflicts = 0;
+		}
+		finish_wait(&head->waitq, &wait);
+		goto retry;
+	}
+	lock->mm = mm;
+	lock->start = PAGE_ALIGN(start);
+	lock->end = PAGE_ALIGN(end) + PAGE_SIZE;
+	list_add(&lock->list, &head->dios);
+	atomic_inc(&mm->mm_users);
+	spin_unlock_irqrestore(&head->lock, flags);
+	finish_wait(&head->waitq, &wait);
+	return 0;
+}
+
+void dio_unlock(struct dio_lock_ent *lock)
+{
+	struct dio_lock_head *head;
+	struct mm_struct *mm;
+	unsigned long flags;
+
+	mm = lock->mm;
+	head = &get_mm_private(mm)->diolock;
+	spin_lock_irqsave(&head->lock, flags);
+	list_del(&lock->list);
+	if (waitqueue_active(&head->waitq))
+		wake_up_all(&head->waitq);
+	spin_unlock_irqrestore(&head->lock, flags);
+	mmput(mm);
+}
+
+int mm_cow_start(struct mm_struct *mm,
+		unsigned long start, unsigned long end)
+{
+	struct dio_lock_head *head;
+	struct dio_lock_ent *lock;
+
+	head = &get_mm_private(mm)->diolock;
+	if (!head->need_dio_check)
+		return 0;
+
+	spin_lock_irq(&head->lock);
+	head->cows++;
+	if (list_empty(&head->dios)) {
+		spin_unlock_irq(&head->lock);
+		return 0;
+	}
+	/* SLOW PATH */	
+	list_for_each_entry(lock, &head->dios, list) {
+		if ((start < lock->end) && (end > lock->start)) {
+			head->cows--;
+			head->conflicts++;
+			spin_unlock_irq(&head->lock);
+			 /* This page fault will be retried but new dio requests will be
+			    delayed until cow ends.*/
+			return 1;
+		}
+	}
+	spin_unlock_irq(&head->lock);
+	return 0;
+}
+
+void mm_cow_end(struct mm_struct *mm)
+{
+	struct dio_lock_head *head;
+
+	head = &get_mm_private(mm)->diolock;
+	if (!head->need_dio_check)
+		return;
+
+	spin_lock_irq(&head->lock);
+	head->cows--;
+	if (!head->cows) {
+		head->conflicts = 0;
+		if (waitqueue_active(&head->waitq))
+			wake_up_all(&head->waitq);
+	}
+	spin_unlock_irq(&head->lock);
+	
+}
Index: kame-odirect-linux/fs/direct-io.c
===================================================================
--- kame-odirect-linux.orig/fs/direct-io.c	2009-01-29 14:01:44.000000000 +0900
+++ kame-odirect-linux/fs/direct-io.c	2009-01-30 10:53:45.000000000 +0900
@@ -34,6 +34,8 @@
 #include <linux/buffer_head.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
+#include <linux/direct-io.h>
+
 #include <asm/atomic.h>
 
 /*
@@ -130,8 +132,43 @@
 	int is_async;			/* is IO async ? */
 	int io_error;			/* IO error in completion path */
 	ssize_t result;                 /* IO result */
+
+	/* For sanity of Direct-IO and Copy-On-Write */
+	struct dio_lock_ent		*locks;
+	int				nr_segs;
 };
 
+int dio_protect_all(struct dio *dio, const struct iovec *iov, int nsegs)
+{
+	struct dio_lock_ent *lock;
+	unsigned long start, end;
+	int seg;
+
+	lock = kzalloc(sizeof(*lock) * nsegs, GFP_KERNEL);
+	if (!lock)
+		return -ENOMEM;
+	dio->locks = lock;
+	dio->nr_segs = nsegs;
+	for (seg = 0; seg < nsegs; seg++) {
+		start = (unsigned long)iov[seg].iov_base;
+		end = (unsigned long)iov[seg].iov_base + iov[seg].iov_len;
+		dio_lock(current->mm, start, end, lock+seg);
+	}
+	return 0;
+}
+
+void dio_release_all_protection(struct dio *dio)
+{
+	int seg;
+
+	if (!dio->locks)
+		return;
+
+	for (seg = 0; seg < dio->nr_segs; seg++)
+		dio_unlock(dio->locks + seg);
+	kfree(dio->locks);
+}
+
 /*
  * How many pages are in the queue?
  */
@@ -284,6 +321,7 @@
 	if (remaining == 0) {
 		int ret = dio_complete(dio, dio->iocb->ki_pos, 0);
 		aio_complete(dio->iocb, ret, 0);
+		dio_release_all_protection(dio);
 		kfree(dio);
 	}
 
@@ -965,6 +1003,7 @@
 
 	dio->iocb = iocb;
 	dio->i_size = i_size_read(inode);
+	dio->locks = NULL;
 
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
@@ -1088,6 +1127,7 @@
 
 	if (ret2 == 0) {
 		ret = dio_complete(dio, offset, ret);
+		dio_release_all_protection(dio);
 		kfree(dio);
 	} else
 		BUG_ON(ret != -EIOCBQUEUED);
@@ -1166,7 +1206,8 @@
 	retval = -ENOMEM;
 	if (!dio)
 		goto out;
-
+	if (dio_protect_all(dio, iov, nr_segs))
+		goto out;
 	/*
 	 * For block device access DIO_NO_LOCKING is used,
 	 *	neither readers nor writers do any locking at all
Index: kame-odirect-linux/kernel/fork.c
===================================================================
--- kame-odirect-linux.orig/kernel/fork.c	2009-01-29 14:01:44.000000000 +0900
+++ kame-odirect-linux/kernel/fork.c	2009-01-30 09:54:05.000000000 +0900
@@ -46,6 +46,7 @@
 #include <linux/delayacct.h>
 #include <linux/taskstats_kern.h>
 #include <linux/hash.h>
+#include <linux/mm_private.h>
 #ifndef __GENKSYMS__
 #include <linux/ptrace.h>
 #include <linux/tty.h>
@@ -77,8 +78,8 @@
 struct hlist_head mm_flags_hash[MM_FLAGS_HASH_SIZE] =
 	{ [ 0 ... MM_FLAGS_HASH_SIZE - 1 ] = HLIST_HEAD_INIT };
 DEFINE_SPINLOCK(mm_flags_lock);
-#define MM_HASH_SHIFT ((sizeof(struct mm_struct) >= 1024) ? 10	\
-		       : (sizeof(struct mm_struct) >= 512) ? 9	\
+#define MM_HASH_SHIFT ((sizeof(struct mm_private) >= 1024) ? 10	\
+		       : (sizeof(struct mm_private) >= 512) ? 9	\
 		       : 8)
 #define mm_flags_hash_fn(mm) \
 	hash_long((unsigned long)(mm) >> MM_HASH_SHIFT, MM_FLAGS_HASH_BITS)
@@ -299,6 +300,17 @@
 	spin_unlock(&mm_flags_lock);
 }
 
+static void init_mm_private(struct mm_private *mmp)
+{
+	dio_lock_init(&mmp->diolock);
+}
+
+static void free_mm_private(struct mm_private *mmp)
+{
+	dio_lock_free(&mmp->diolock);
+}
+
+
 #ifdef CONFIG_MMU
 static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -430,7 +442,7 @@
  __cacheline_aligned_in_smp DEFINE_SPINLOCK(mmlist_lock);
 
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, SLAB_KERNEL))
-#define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
+#define free_mm(mm)	(kmem_cache_free(mm_cachep, get_mm_private((mm))))
 
 #include <linux/init_task.h>
 
@@ -451,6 +463,7 @@
 	mm->ioctx_list = NULL;
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
+	init_mm_private(get_mm_private(mm));
 
 	mm_flags = get_mm_flags(current->mm);
 	if (mm_flags != MMF_DUMP_FILTER_DEFAULT) {
@@ -466,6 +479,7 @@
 	if (mm_flags != MMF_DUMP_FILTER_DEFAULT)
 		free_mm_flags(mm);
 fail_nomem:
+	free_mm_private(get_mm_private(mm));
 	free_mm(mm);
 	return NULL;
 }
@@ -494,6 +508,7 @@
 {
 	BUG_ON(mm == &init_mm);
 	free_mm_flags(mm);
+	free_mm_private(get_mm_private(mm));
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	free_mm(mm);
@@ -1550,7 +1565,7 @@
 			sizeof(struct vm_area_struct), 0,
 			SLAB_PANIC, NULL, NULL);
 	mm_cachep = kmem_cache_create("mm_struct",
-			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
+			sizeof(struct mm_private), ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
 }
 
Index: kame-odirect-linux/mm/Makefile
===================================================================
--- kame-odirect-linux.orig/mm/Makefile	2009-01-29 14:01:44.000000000 +0900
+++ kame-odirect-linux/mm/Makefile	2009-01-29 14:01:59.000000000 +0900
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o
+			   vmalloc.o diolock.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
Index: kame-odirect-linux/mm/memory.c
===================================================================
--- kame-odirect-linux.orig/mm/memory.c	2009-01-29 14:01:44.000000000 +0900
+++ kame-odirect-linux/mm/memory.c	2009-01-29 16:18:19.000000000 +0900
@@ -50,6 +50,7 @@
 #include <linux/delayacct.h>
 #include <linux/init.h>
 #include <linux/writeback.h>
+#include <linux/direct-io.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -1665,6 +1666,7 @@
 	int reuse = 0, ret = VM_FAULT_MINOR;
 	struct page *dirty_page = NULL;
 	int dirty_pte = 0;
+	int dio_stop = 0;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page)
@@ -1738,6 +1740,7 @@
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
+
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
 	if (old_page == ZERO_PAGE(address)) {
@@ -1748,6 +1751,11 @@
 		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
 		if (!new_page)
 			goto oom;
+		if (mm_cow_start(mm, address, address+PAGE_SIZE)) {
+			page_cache_release(new_page);
+			goto out_retry;
+		}
+		dio_stop = 1;
 		cow_user_page(new_page, old_page, address);
 	}
 
@@ -1789,6 +1797,9 @@
 		page_cache_release(new_page);
 	if (old_page)
 		page_cache_release(old_page);
+	/* Allow DIO progress */
+	if (dio_stop)
+		mm_cow_end(mm);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 	if (dirty_page) {
@@ -1797,6 +1808,10 @@
 		put_page(dirty_page);
 	}
 	return ret;
+out_retry:
+	if (old_page)
+		page_cache_release(old_page);
+	return ret;
 oom:
 	if (old_page)
 		page_cache_release(old_page);
Index: kame-odirect-linux/mm/hugetlb.c
===================================================================
--- kame-odirect-linux.orig/mm/hugetlb.c	2009-01-29 14:01:44.000000000 +0900
+++ kame-odirect-linux/mm/hugetlb.c	2009-01-29 16:29:51.000000000 +0900
@@ -14,6 +14,7 @@
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
+#include <linux/direct-io.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -470,7 +471,13 @@
 		page_cache_release(old_page);
 		return VM_FAULT_OOM;
 	}
-
+	if (mm_cow_start(mm, address & HPAGE_MASK, HPAGE_SIZE)) {
+		/* we have to retry. */
+		page_cache_release(old_page);
+		page_cache_release(new_page);
+		return VM_FAULT_MINOR;
+	}
+	
 	spin_unlock(&mm->page_table_lock);
 	copy_huge_page(new_page, old_page, address);
 	spin_lock(&mm->page_table_lock);
@@ -486,6 +493,8 @@
 	}
 	page_cache_release(new_page);
 	page_cache_release(old_page);
+	mm_cow_end(mm);
+
 	return VM_FAULT_MINOR;
 }
 
Index: kame-odirect-linux/include/linux/mm_private.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ kame-odirect-linux/include/linux/mm_private.h	2009-01-30 09:52:26.000000000 +0900
@@ -0,0 +1,24 @@
+#ifndef __LINUX_MM_PRIVATE_H
+#define __LINUX_MM_PRIVATE_H
+
+#include <linux/sched.h>
+#include <linux/direct-io.h>
+
+/*
+ * Because we have to keep KABI, we cannot modify mm_struct itself. This
+ * mm_private is per-process object and not covered by KABI.
+ * Just for a fields of future bugfix.
+ * Note: Now, this is not copied at fork().
+ */
+struct mm_private {
+	struct mm_struct	mm;
+	/* For fixing direct-io/COW races. */
+	struct dio_lock_head	diolock;
+};
+
+static inline struct mm_private *get_mm_private(struct mm_struct *mm)
+{
+	return container_of(mm, struct mm_private, mm);
+}
+
+#endif





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 16:01                     ` KOSAKI Motohiro
  2009-03-16 16:23                       ` Nick Piggin
@ 2009-03-17  0:44                       ` Linus Torvalds
  2009-03-17  0:56                         ` KAMEZAWA Hiroyuki
  2009-03-17 12:19                         ` Andrea Arcangeli
  1 sibling, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2009-03-17  0:44 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Nick Piggin, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, KOSAKI Motohiro wrote:
> 
> if we only need concern to O_DIRECT, below patch is enough.

.. together with something like this, to handle the other direction. This 
should take care of the case of an O_DIRECT write() call using a page that 
was duplicated by an _earlier_ fork(), and then got split up by a COW in
the wrong direction (ie having data from the child show up in the write).

Untested. But fairly trivial, after all. We simply do the same old 
"reuse_swap_page()" count, but we only break the COW if the page count 
afterwards is 1 (reuse_swap_page will have removed it from the swap cache 
if it returns success).

Does this (together with Kosaki's patch) pass the tests that Andrea had?

		Linus

---
 mm/memory.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index baa999e..2bd5fb0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1928,7 +1928,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			}
 			page_cache_release(old_page);
 		}
-		reuse = reuse_swap_page(old_page);
+		/*
+		 * If we can re-use the swap page _and_ the end
+		 * result has only one user (the mapping), then
+		 * we reuse the whole page
+		 */
+		if (reuse_swap_page(old_page))
+			reuse = page_count(old_page) == 1;
 		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17  0:44                       ` Linus Torvalds
@ 2009-03-17  0:56                         ` KAMEZAWA Hiroyuki
  2009-03-17 12:19                         ` Andrea Arcangeli
  1 sibling, 0 replies; 83+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-17  0:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt,
	Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins,
	linux-mm

On Mon, 16 Mar 2009 17:44:25 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Tue, 17 Mar 2009, KOSAKI Motohiro wrote:
> > 
> > if we only need concern to O_DIRECT, below patch is enough.
> 
> .. together with something like this, to handle the other direction. This 
> should take care of the case of an O_DIRECT write() call using a page that 
> was duplicated by an _earlier_ fork(), and then got split up by a COW in
> the wrong direction (ie having data from the child show up in the write).
> 
> Untested. But fairly trivial, after all. We simply do the same old 
> "reuse_swap_page()" count, but we only break the COW if the page count 
> afterwards is 1 (reuse_swap_page will have removed it from the swap cache 
> if it returns success).
> 
> Does this (together with Kosaki's patch) pass the tests that Andrea had?
> 
I'm not sure but I doubt "AIO" case.

+	down_read(&current->mm->directio_sem);
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
 				nr_segs, blkbits, get_block, end_io, dio);
+	up_read(&current->mm->directio_sem);

If AIO, this semaphore range seems to be not enough. 

Thanks,
-Kame



> 		Linus
> 
> ---
>  mm/memory.c |    8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index baa999e..2bd5fb0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1928,7 +1928,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			}
>  			page_cache_release(old_page);
>  		}
> -		reuse = reuse_swap_page(old_page);
> +		/*
> +		 * If we can re-use the swap page _and_ the end
> +		 * result has only one user (the mapping), then
> +		 * we reuse the whole page
> +		 */
> +		if (reuse_swap_page(old_page))
> +			reuse = page_count(old_page) == 1;
>  		unlock_page(old_page);
>  	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>  					(VM_WRITE|VM_SHARED))) {
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 19:17                                         ` Linus Torvalds
@ 2009-03-17  5:42                                           ` Nick Piggin
  2009-03-17  5:58                                             ` Nick Piggin
  0 siblings, 1 reply; 83+ messages in thread
From: Nick Piggin @ 2009-03-17  5:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tuesday 17 March 2009 06:17:21 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > What's buggy about it? Stupid bugs, or fundamentally broken?
>
> The lack of locking.

I don't think it's broken. I can't see a problem.


> > In my opinion it is not, given that you have to convert callers. If you
> > say that you only care about fixing O_DIRECT, then yes I would probably
> > agree the lock is nicer in that case.
>
> F*ck me, I'm not going to bother to argue. I'm not going to merge your
> patch, it's that easy.
>
> Quite frankly, I don't think that the "bug" is a bug to begin with.
> O_DIRECT+fork() can damn well continue to be broken. But if we fix it, we
> fix it the _clean_ way with a simple patch, not with that shit-for-logic
> horrible decow crap.
>
> It's that simple. I refuse to take putrid industrial waste patches for
> something like this.

I consider it is clean because it only adds branches in 3 places that
are not taken unless direct IO and fork are used, and it fixes the
"problem" in the VM directly leaving get_user_pages unchanged.

I don't think it is conceptually such a problem to copy pages rather
than COW them in fork. Seems fairly straightforward to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 19:22                                           ` Linus Torvalds
@ 2009-03-17  5:44                                             ` Nick Piggin
  0 siblings, 0 replies; 83+ messages in thread
From: Nick Piggin @ 2009-03-17  5:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tuesday 17 March 2009 06:22:12 Linus Torvalds wrote:
> On Tue, 17 Mar 2009, Nick Piggin wrote:
> > > So is all the extra flags for no
> > > good reason.
> >
> > Which extra flags are you referring to?
>
> Fuck me, didn't you even read your own patch?
>
> What do you call PG_dontcow?

It is a flag, there for a good reason.

It sounded like you were seeing more than one flag, and that
you thought they were useless.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17  5:42                                           ` Nick Piggin
@ 2009-03-17  5:58                                             ` Nick Piggin
  0 siblings, 0 replies; 83+ messages in thread
From: Nick Piggin @ 2009-03-17  5:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Benjamin Herrenschmidt, Andrea Arcangeli,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tuesday 17 March 2009 16:42:24 Nick Piggin wrote:

> I consider it is clean because it only adds branches in 3 places that
> are not taken unless direct IO and fork are used, and it fixes the
> "problem" in the VM directly leaving get_user_pages unchanged.

leaving get_user_pages callers unchanged.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17  0:44                       ` Linus Torvalds
  2009-03-17  0:56                         ` KAMEZAWA Hiroyuki
@ 2009-03-17 12:19                         ` Andrea Arcangeli
  2009-03-17 16:43                           ` Linus Torvalds
  1 sibling, 1 reply; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-17 12:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Mon, Mar 16, 2009 at 05:44:25PM -0700, Linus Torvalds wrote:
> -		reuse = reuse_swap_page(old_page);
> +		/*
> +		 * If we can re-use the swap page _and_ the end
> +		 * result has only one user (the mapping), then
> +		 * we reuse the whole page
> +		 */
> +		if (reuse_swap_page(old_page))
> +			reuse = page_count(old_page) == 1;
>  		unlock_page(old_page);

Think if the anon page is added to swapcache and the pte is unmapped
by the VM and set non present after GUP taken the page for a O_DIRECT
read (write to memory). If a thread writes to the page while the
O_DIRECT read is running in another thread (or aio), then do_wp_page
will make a copy of the swapcache under O_DIRECT read, and part of the
read operation will get lost.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17 12:19                         ` Andrea Arcangeli
@ 2009-03-17 16:43                           ` Linus Torvalds
  2009-03-17 17:01                             ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-17 16:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Andrea Arcangeli wrote:
> 
> Think if the anon page is added to swapcache and the pte is unmapped
> by the VM and set non present after GUP taken the page for a O_DIRECT
> read (write to memory). If a thread writes to the page while the
> O_DIRECT read is running in another thread (or aio), then do_wp_page
> will make a copy of the swapcache under O_DIRECT read, and part of the
> read operation will get lost.

In that case, you aren't getting to the "do_wp_page()" case at all, you're 
getting the "do_swap_page()" case. Which does its own reuse_swap_page() 
thing (and that one I didn't touch - on purpose).

But you're right - it only does that for writes. If we _first_ do a read 
(to swap it back in), it will mark it read-only and _then_ we can get a 
"do_wp_page()" that splits it.

So yes - I had expected our VM to be sane, and have a writable private 
page _stay_ writable (in the absense of fork() it should never turn into a 
COW page), but the swapout+swapin code can result in a rw page that turns 
read-only in order to catch a swap cache invalidation.

Good catch. Let me think about it.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17 16:43                           ` Linus Torvalds
@ 2009-03-17 17:01                             ` Linus Torvalds
  2009-03-17 17:10                               ` Andrea Arcangeli
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-17 17:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Linus Torvalds wrote:
> 
> So yes - I had expected our VM to be sane, and have a writable private 
> page _stay_ writable (in the absense of fork() it should never turn into a 
> COW page), but the swapout+swapin code can result in a rw page that turns 
> read-only in order to catch a swap cache invalidation.
> 
> Good catch. Let me think about it.

Btw, I think this is actually a pre-existing bug regardless of my patch.

That same swapout+swapin problem seems to lose the dirty bit on a O_DIRECT 
write - exactly for the same reason. When swapin turns the page into a 
read-only page in order to keep the physical page in the swap cache, the 
write to the physical page (that was gotten by get_user_pages() earlier) 
will bypass all that.

So the get_user_pages() users will then write to the page, but the next 
time we swap things out, if nobody _else_ wrote to it, that write will be 
lost because we'll just drop the page (it was in the swap cache!) even 
though it had changed data on it.

My patch changed the schenario a bit (split page rather than dropped 
page), but the fundamental cause seems to be the same - the swap cache 
code very much depends on writes to the _virtual_ address.

Or am I missing something?

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17 17:01                             ` Linus Torvalds
@ 2009-03-17 17:10                               ` Andrea Arcangeli
  2009-03-17 17:43                                 ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-17 17:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tue, Mar 17, 2009 at 10:01:06AM -0700, Linus Torvalds wrote:
> That same swapout+swapin problem seems to lose the dirty bit on a O_DIRECT 

I think the dirty bit is set in dio_bio_complete (or
bio_check_pages_dirty for the aio case) so forcing the swapcache to be
written out again before the page can be freed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17 17:10                               ` Andrea Arcangeli
@ 2009-03-17 17:43                                 ` Linus Torvalds
  2009-03-17 18:09                                   ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-17 17:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Andrea Arcangeli wrote:

> On Tue, Mar 17, 2009 at 10:01:06AM -0700, Linus Torvalds wrote:
> > That same swapout+swapin problem seems to lose the dirty bit on a O_DIRECT 
> 
> I think the dirty bit is set in dio_bio_complete (or
> bio_check_pages_dirty for the aio case) so forcing the swapcache to be
> written out again before the page can be freed.

Do all the other get_user_pages() users do that, though?

[ Looks around - at least access_process_vm(), IB and the NFS direct code 
  do. So we seem to be mostly ok, at least for the main users ]

Ok, no worries.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17 17:43                                 ` Linus Torvalds
@ 2009-03-17 18:09                                   ` Linus Torvalds
  2009-03-17 18:19                                     ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-17 18:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Linus Torvalds wrote:
> 
> Do all the other get_user_pages() users do that, though?
> 
> [ Looks around - at least access_process_vm(), IB and the NFS direct code 
>   do. So we seem to be mostly ok, at least for the main users ]
> 
> Ok, no worries.

This problem is actually pretty easy to fix for anonymous pages: since the 
act of pinning (for writes) should have done all the COW stuff and made 
sure the page is not in the swap cache, we only need to avoid adding it 
back.

IOW, something like the following makes sense on all levels regardless 
(note: I didn't check if there is some off-by-one issue where we've raised 
the page count for other reasons when scanning it, so this is not meant to 
be a serious patch, just a "something along these lines" thing).

This does not obviate the need to mark pages dirty afterwards, though, 
since true shared mappings always cause that (and we cannot keep them 
dirty, since somebody may be doing fsync() on them or something like 
that).

But since the COW issue is only a matter of private pages, this handles 
that trivially.

			Linus

---
 mm/swap_state.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..83137fe 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -140,6 +140,10 @@ int add_to_swap(struct page *page)
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageUptodate(page));
 
+	/* Refuse to add pinned pages to the swap cache */
+	if (page_count(page) > page_mapped(page))
+		return 0;
+
 	for (;;) {
 		entry = get_swap_page();
 		if (!entry.val)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17 18:09                                   ` Linus Torvalds
@ 2009-03-17 18:19                                     ` Linus Torvalds
  2009-03-17 18:46                                       ` Andrea Arcangeli
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-17 18:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Linus Torvalds wrote:
> 
> This problem is actually pretty easy to fix for anonymous pages: since the 
> act of pinning (for writes) should have done all the COW stuff and made 
> sure the page is not in the swap cache, we only need to avoid adding it 
> back.

An alternative approach would have been to just count page pinning as 
being a "referenced", which to some degree would be even more logical (we 
don't set the referenced flag when we look those pages up). That would 
also affect pages that were get_user_page'd just for reading, which might 
be seen as an additional bonus.

The "don't turn pinned pages into swap cache pages" is a somewhat more 
direct patch, though. It gives more obvious guarantees about the lifetime 
behaviour of anon pages wrt get_user_pages[_fast]().. 

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17 18:19                                     ` Linus Torvalds
@ 2009-03-17 18:46                                       ` Andrea Arcangeli
  2009-03-17 19:03                                         ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-17 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tue, Mar 17, 2009 at 11:19:59AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 17 Mar 2009, Linus Torvalds wrote:
> > 
> > This problem is actually pretty easy to fix for anonymous pages: since the 
> > act of pinning (for writes) should have done all the COW stuff and made 
> > sure the page is not in the swap cache, we only need to avoid adding it 
> > back.
> 
> An alternative approach would have been to just count page pinning as 
> being a "referenced", which to some degree would be even more logical (we 
> don't set the referenced flag when we look those pages up). That would 
> also affect pages that were get_user_page'd just for reading, which might 
> be seen as an additional bonus.
> 
> The "don't turn pinned pages into swap cache pages" is a somewhat more 
> direct patch, though. It gives more obvious guarantees about the lifetime 
> behaviour of anon pages wrt get_user_pages[_fast]().. 

I don't think you can tackle this from add_to_swap because the page
may be in the swapcache well before gup runs (gup(write=1) can map the
swapcache as exclusive and read-write in the pte). So then what
happens is again that the VM unmaps the page, do_swap_page map it as
readonly swapcache (so far so good), and the do_wp_page copies the
page under O_DIRECT read again.

The off by one is most certain as it's invoked by the VM but that's an
implementation detail not relevant for this discussion agreed, and I
guess you also meant page_mapcount instead of page_mapped or I think
shared pages would stop being swapped out. That is more relevant
because of some worry I have in the comparison between page count and
mapcount, see below.

My preference is still to keeps pages with elevated refcount pinned in
the ptes like 2.6.7 did, that will allow do_wp_page to takeover only
pages with page_count not elevated without risk of calling do_wp_page
on any page under gup. Only worry I have now is how to compare count
with mapcount when both can change under us if mapcount > 1, but if
you meant page_mapcount in add_to_swap as I think, that logic in
add_to_swap would have the same problem and so it needs a solution for
doing a coherent/safe comparison too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17 18:46                                       ` Andrea Arcangeli
@ 2009-03-17 19:03                                         ` Linus Torvalds
  2009-03-17 19:35                                           ` Andrea Arcangeli
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-17 19:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Andrea Arcangeli wrote:
> 
> I don't think you can tackle this from add_to_swap because the page
> may be in the swapcache well before gup runs (gup(write=1) can map the
> swapcache as exclusive and read-write in the pte).

If it's in the swap cache, it should be mapped read-only, and gup(write=1) 
will do the COW break and un-swapcache it.

When can it be writably in the swap cache? The write-only thing is the one 
we use to invalidate stale swap cache entries, and when we mark those 
pages writable (in do_wp_page or do_swap_page) we always remove the page 
from the swap cache at the same time.

Or is there some other path I missed?

> My preference is still to keeps pages with elevated refcount pinned in
> the ptes like 2.6.7 did, that will allow do_wp_page to takeover only
> pages with page_count not elevated without risk of calling do_wp_page
> on any page under gup.

I agree that that would also work - and be even simpler. If done right, we 
can even avoid clearing the dirty bit (in page_mkclean()) for such pages, 
and now it works for _all_ pages, not just anonymous pages.

IOW, even if you had a shared mapping and were to GUP() those pages for 
writing, they'd _stay_ dirty until you free'd them - no need to re-dirty 
them in case somebody did IO on them. 

> Only worry I have now is how to compare count
> with mapcount when both can change under us if mapcount > 1, but if
> you meant page_mapcount in add_to_swap as I think, that logic in
> add_to_swap would have the same problem and so it needs a solution for
> doing a coherent/safe comparison too.

I don't think you can use just mapcount on its own - you have to compare 
it to page_count(). Otherwise perfectly normal (non-gup) pages will 
trigger, since that page count is the only thing that differs between the 
two cases.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17 19:03                                         ` Linus Torvalds
@ 2009-03-17 19:35                                           ` Andrea Arcangeli
  2009-03-17 19:55                                             ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Arcangeli @ 2009-03-17 19:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Tue, Mar 17, 2009 at 12:03:55PM -0700, Linus Torvalds wrote:
> If it's in the swap cache, it should be mapped read-only, and gup(write=1) 
> will do the COW break and un-swapcache it.

It may turn it read-write instead of COW break and un-swapcache.

   if (write_access && reuse_swap_page(page)) {
      pte = maybe_mkwrite(pte_mkdirty(pte), vma);

This is done to avoid fragmenting the swap device.

> I agree that that would also work - and be even simpler. If done right, we 
> can even avoid clearing the dirty bit (in page_mkclean()) for such pages, 
> and now it works for _all_ pages, not just anonymous pages.
> 
> IOW, even if you had a shared mapping and were to GUP() those pages for 
> writing, they'd _stay_ dirty until you free'd them - no need to re-dirty 
> them in case somebody did IO on them. 

I agree in principle, if the VM stays away from pages under GUP
theoretically the dirty bit shouldn't be transferred to the PG_dirty
of the page until after the I/O is complete, so the dirty bit set by
gup in the pte may be enough. Not sure if there are other places that
could transfer the dirty bit of the pte before the gup user releases
the page-pin.

> I don't think you can use just mapcount on its own - you have to compare 
> it to page_count(). Otherwise perfectly normal (non-gup) pages will 
> trigger, since that page count is the only thing that differs between the 
> two cases.

Yes, page_count shall be compared with page_mapcount. My worry is only
that both can change from under us if mapcount > 1 (not enough to hold
PT lock to be sure mapcount/count is stable if mapcount > 1).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-17 19:35                                           ` Andrea Arcangeli
@ 2009-03-17 19:55                                             ` Linus Torvalds
  0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2009-03-17 19:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, Nick Piggin, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Tue, 17 Mar 2009, Andrea Arcangeli wrote:

> On Tue, Mar 17, 2009 at 12:03:55PM -0700, Linus Torvalds wrote:
> > If it's in the swap cache, it should be mapped read-only, and gup(write=1) 
> > will do the COW break and un-swapcache it.
> 
> It may turn it read-write instead of COW break and un-swapcache.
> 
>    if (write_access && reuse_swap_page(page)) {
>       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> 
> This is done to avoid fragmenting the swap device.

Right, but reuse_swap_page() will have removed it from the swapcache if it 
returns success.

So if the page is writable in the page tables, it should not be in the 
swap cache.

Oh, except that we do it in shrink_page_list(), and while we're going to 
do that whole "try_to_unmap()", I guess it can fail to unmap there? In 
that case, you could actually have it in the page tables while in the swap 
cache.

And besides, we do remove it from the page tables in the wrong order (ie 
we add it to the swap cache first, _then_ remove it), so I guess that also 
ends up being a race with another CPU doing fast-gup. And we _have_ to do 
it in that order at least for the map_count > 1 case, since a read-only 
swap page may be shared by multiple mm's, and the swap-cache is how we 
make sure that they all end up joining together.

Of course, the only case we really care about is the map_count=1 case, 
since that's the only one that is possible after GUP has succeeded 
(assuming, as always, that fork() is locked out of making copies). So we 
really only care about the simpler case.

> I agree in principle, if the VM stays away from pages under GUP
> theoretically the dirty bit shouldn't be transferred to the PG_dirty
> of the page until after the I/O is complete, so the dirty bit set by
> gup in the pte may be enough. Not sure if there are other places that
> could transfer the dirty bit of the pte before the gup user releases
> the page-pin.

I do suspect there are subtle issues like the above. 

> > I don't think you can use just mapcount on its own - you have to compare 
> > it to page_count(). Otherwise perfectly normal (non-gup) pages will 
> > trigger, since that page count is the only thing that differs between the 
> > two cases.
> 
> Yes, page_count shall be compared with page_mapcount. My worry is only
> that both can change from under us if mapcount > 1 (not enough to hold
> PT lock to be sure mapcount/count is stable if mapcount > 1).

Now, that's not a big worry, because we only care about mapcount=1 for the 
anonymous page case at least. So we can stabilize that one with the pt 
lock.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-16 16:23                       ` Nick Piggin
  2009-03-16 16:32                         ` Linus Torvalds
@ 2009-03-18  2:04                         ` KOSAKI Motohiro
  2009-03-22 12:23                           ` KOSAKI Motohiro
  1 sibling, 1 reply; 83+ messages in thread
From: KOSAKI Motohiro @ 2009-03-18  2:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, Benjamin Herrenschmidt, Linus Torvalds,
	Andrea Arcangeli, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-mm

Hi


> > ---
> >  fs/direct-io.c            |    2 ++
> >  include/linux/init_task.h |    1 +
> >  include/linux/mm_types.h  |    3 +++
> >  kernel/fork.c             |    3 +++
> >  4 files changed, 9 insertions(+), 0 deletions(-)
> 
> It is an interesting patch. Thanks for throwing it into the discussion.
> I do prefer to close the race up for all cases if we decide to do
> anything at all about it, ie. all or nothing. But maybe others disagree.

Honestly, I wan't excepting linus's reaction. but I hope to make my v2.

My point is:
  - my patch don't prevent implement madvice(DONTCOW), I think.
  - andrea patch's complexity is mainly caused by avoiding perfromance degression effort,
    then, kernel later improvement can shrink his patch automatically.
    furtunately KSM don't merge yet. we can discuss his patch again at KSM submitting.
  - anyway, it can fix the bug.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-18  2:04                         ` KOSAKI Motohiro
@ 2009-03-22 12:23                           ` KOSAKI Motohiro
  2009-03-23  0:13                             ` KOSAKI Motohiro
  2009-03-24 13:43                             ` Nick Piggin
  0 siblings, 2 replies; 83+ messages in thread
From: KOSAKI Motohiro @ 2009-03-22 12:23 UTC (permalink / raw)
  To: Nick Piggin, Linus Torvalds, Andrea Arcangeli
  Cc: kosaki.motohiro, Benjamin Herrenschmidt, Ingo Molnar,
	Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm

Hi

following patch is my v2 approach.
it survive Andrea's three dio test-case.

Linus suggested to change add_to_swap() and shrink_page_list() stuff
for avoid false cow in do_wp_page() when page become to swapcache.

I think it's good idea. but it's a bit radical. so I think it's for development
tree tackle.

Then, I decide to use Nick's early decow in 
get_user_pages() and RO mapped page don't use gup_fast.

yeah, my approach is extream brutal way and big hammer. but I think 
it don't have performance issue in real world.

why?

Practically, we can assume following two thing.

(1) the buffer of passed write(2) syscall argument is RW mapped
    page or COWed RO page.

if anybody write following code, my path cause performance degression.

   buf = mmap()
   memset(buf, 0x11, len);
   mprotect(buf, len, PROT_READ)
   fd = open(O_DIRECT)
   write(fd, buf, len)

but it's very artifactical code. nobody want this.
ok, we can ignore this.

(2) DirectIO user process isn't short lived process.

early decow only decrease short lived process performaqnce. 
because long lived process do decowing anyway before exec(2).

and, All DB application is definitely long lived process.
then early decow don't cause degression.


TODO
  - implement down_write_killable().
    (but it isn't important thing because this is rare case issue.)
  - implement non x86 portion.


Am I missing any thing?


Note: this is still RFC. not intent submission.

--
 arch/x86/mm/gup.c         |   22 ++++++++++++++--------
 fs/direct-io.c            |   11 +++++++++++
 include/linux/init_task.h |    1 +
 include/linux/mm.h        |    9 +++++++++
 include/linux/mm_types.h  |    6 ++++++
 kernel/fork.c             |    3 +++
 mm/internal.h             |   10 ----------
 mm/memory.c               |   17 ++++++++++++++++-
 mm/util.c                 |    8 ++++++--
 9 files changed, 66 insertions(+), 21 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index be54176..02e479b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -74,8 +74,10 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 	pte_t *ptep;
 
 	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
+
+	/* Maybe the read only pte is cow mapped page. (or not maybe)
+	   So, falling back to get_user_pages() is better */
+	mask |= _PAGE_RW;
 
 	ptep = pte_offset_map(&pmd, addr);
 	do {
@@ -114,8 +116,7 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 	int refs;
 
 	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
+	mask |= _PAGE_RW;
 	if ((pte_flags(pte) & mask) != mask)
 		return 0;
 	/* hugepages are never "special" */
@@ -171,8 +172,7 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 	int refs;
 
 	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
+	mask |= _PAGE_RW;
 	if ((pte_flags(pte) & mask) != mask)
 		return 0;
 	/* hugepages are never "special" */
@@ -272,6 +272,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
 	{
 		int ret;
+		int gup_flags;
 
 slow:
 		local_irq_enable();
@@ -280,9 +281,14 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
+		gup_flags = GUP_FLAGS_PINNING_PAGE;
+		if (write)
+			gup_flags |= GUP_FLAGS_WRITE;
+
 		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
+		ret = __get_user_pages(current, mm, start,
+				       (end - start) >> PAGE_SHIFT, gup_flags,
+				       pages, NULL);
 		up_read(&mm->mmap_sem);
 
 		/* Have to be a bit careful with return values */
diff --git a/fs/direct-io.c b/fs/direct-io.c
index b6d4390..4f46720 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -131,6 +131,9 @@ struct dio {
 	int is_async;			/* is IO async ? */
 	int io_error;			/* IO error in completion path */
 	ssize_t result;                 /* IO result */
+
+	/* fork exclusive stuff */
+	struct mm_struct *mm;
 };
 
 /*
@@ -243,6 +246,9 @@ static int dio_complete(struct dio *dio, loff_t offset, int ret)
 	if (dio->lock_type == DIO_LOCKING)
 		/* lockdep: non-owner release */
 		up_read_non_owner(&dio->inode->i_alloc_sem);
+	up_read_non_owner(&dio->mm->mm_pinned_sem);
+	mmdrop(dio->mm);
+	dio->mm = NULL;
 
 	if (ret == 0)
 		ret = dio->page_errors;
@@ -942,6 +948,7 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	ssize_t ret = 0;
 	ssize_t ret2;
 	size_t bytes;
+	struct mm_struct *mm;
 
 	dio->inode = inode;
 	dio->rw = rw;
@@ -960,6 +967,10 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
 
+	mm = dio->mm = current->mm;
+	atomic_inc(&mm->mm_count);
+	down_read_non_owner(&mm->mm_pinned_sem);
+
 	/*
 	 * In case of non-aligned buffers, we may need 2 more
 	 * pages since we need to zero out first and last block.
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index e752d97..3bc134a 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -37,6 +37,7 @@ extern struct fs_struct init_fs;
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(name.page_table_lock),	\
 	.mmlist		= LIST_HEAD_INIT(name.mmlist),		\
 	.cpu_vm_mask	= CPU_MASK_ALL,				\
+	.mm_pinned_sem	= __RWSEM_INITIALIZER(name.mm_pinned_sem), \
 }
 
 #define INIT_SIGNALS(sig) {						\
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065cdf8..dcc6ccc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -823,6 +823,15 @@ static inline int handle_mm_fault(struct mm_struct *mm,
 extern int make_pages_present(unsigned long addr, unsigned long end);
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr, void *buf, int len, int write);
 
+#define GUP_FLAGS_WRITE				0x01
+#define GUP_FLAGS_FORCE				0x02
+#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS	0x04
+#define GUP_FLAGS_IGNORE_SIGKILL		0x08
+#define GUP_FLAGS_PINNING_PAGE			0x10
+
+int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
+		     unsigned long start, int len, int flags,
+		     struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start,
 		int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d84feb7..27089d9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -274,6 +274,12 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+
+	/*
+	 * if there are on-flight directio or similar pinning action,
+	 * COW cause memory corruption. the sem protect it by preventing fork.
+	 */
+	struct rw_semaphore mm_pinned_sem;
 };
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 4854c2c..ded7caf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -266,6 +266,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 	unsigned long charge;
 	struct mempolicy *pol;
 
+	down_write(&oldmm->mm_pinned_sem);
 	down_write(&oldmm->mmap_sem);
 	flush_cache_dup_mm(oldmm);
 	/*
@@ -368,6 +369,7 @@ out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);
 	up_write(&oldmm->mmap_sem);
+	up_write(&oldmm->mm_pinned_sem);
 	return retval;
 fail_nomem_policy:
 	kmem_cache_free(vm_area_cachep, tmp);
@@ -431,6 +433,7 @@ static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
 	mm->free_area_cache = TASK_UNMAPPED_BASE;
 	mm->cached_hole_size = ~0UL;
 	mm_init_owner(mm, p);
+	init_rwsem(&mm->mm_pinned_sem);
 
 	if (likely(!mm_alloc_pgd(mm))) {
 		mm->def_flags = 0;
diff --git a/mm/internal.h b/mm/internal.h
index 478223b..04f25d2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -272,14 +272,4 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
 {
 }
 #endif /* CONFIG_SPARSEMEM */
-
-#define GUP_FLAGS_WRITE                  0x1
-#define GUP_FLAGS_FORCE                  0x2
-#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
-#define GUP_FLAGS_IGNORE_SIGKILL         0x8
-
-int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
-		     unsigned long start, int len, int flags,
-		     struct page **pages, struct vm_area_struct **vmas);
-
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index baa999e..b00e3e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1211,6 +1211,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 	int force = !!(flags & GUP_FLAGS_FORCE);
 	int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
 	int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL);
+	int decow = 0;
 
 	if (len <= 0)
 		return 0;
@@ -1279,6 +1280,20 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			continue;
 		}
 
+		/*
+		 * Except in special cases where the caller will not read to or
+		 * write from these pages, we must break COW for any pages
+		 * returned from get_user_pages, so that our caller does not
+		 * subsequently end up with the pages of a parent or child
+		 * process after a COW takes place.
+		 */
+		if (flags & GUP_FLAGS_PINNING_PAGE) {
+			if (!pages)
+				return -EINVAL;
+			if (is_cow_mapping(vma->vm_flags))
+				decow = 1;
+		}
+
 		foll_flags = FOLL_TOUCH;
 		if (pages)
 			foll_flags |= FOLL_GET;
@@ -1299,7 +1314,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 					fatal_signal_pending(current)))
 				return i ? i : -ERESTARTSYS;
 
-			if (write)
+			if (write || decow)
 				foll_flags |= FOLL_WRITE;
 
 			cond_resched();
diff --git a/mm/util.c b/mm/util.c
index 37eaccd..a80d5d3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -197,10 +197,14 @@ int __attribute__((weak)) get_user_pages_fast(unsigned long start,
 {
 	struct mm_struct *mm = current->mm;
 	int ret;
+	int gup_flags = GUP_FLAGS_PINNING_PAGE;
+
+	if (write)
+		gup_flags |= GUP_FLAGS_WRITE;
 
 	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, start, nr_pages,
-					write, 0, pages, NULL);
+	ret = __get_user_pages(current, mm, start, nr_pages,
+			       gup_flags, pages, NULL);
 	up_read(&mm->mmap_sem);
 
 	return ret;



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-22 12:23                           ` KOSAKI Motohiro
@ 2009-03-23  0:13                             ` KOSAKI Motohiro
  2009-03-23 16:29                               ` Ingo Molnar
  2009-03-24 13:43                             ` Nick Piggin
  1 sibling, 1 reply; 83+ messages in thread
From: KOSAKI Motohiro @ 2009-03-23  0:13 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Nick Piggin, Linus Torvalds, Andrea Arcangeli,
	Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-mm

> Hi
> 
> following patch is my v2 approach.
> it survive Andrea's three dio test-case.
> 
> Linus suggested to change add_to_swap() and shrink_page_list() stuff
> for avoid false cow in do_wp_page() when page become to swapcache.
> 
> I think it's good idea. but it's a bit radical. so I think it's for development
> tree tackle.
> 
> Then, I decide to use Nick's early decow in 
> get_user_pages() and RO mapped page don't use gup_fast.
> 
> yeah, my approach is extream brutal way and big hammer. but I think 
> it don't have performance issue in real world.
> 
> why?
> 
> Practically, we can assume following two thing.
> 
> (1) the buffer of passed write(2) syscall argument is RW mapped
>     page or COWed RO page.
> 
> if anybody write following code, my path cause performance degression.
> 
>    buf = mmap()
>    memset(buf, 0x11, len);
>    mprotect(buf, len, PROT_READ)
>    fd = open(O_DIRECT)
>    write(fd, buf, len)
> 
> but it's very artifactical code. nobody want this.
> ok, we can ignore this.
> 
> (2) DirectIO user process isn't short lived process.
> 
> early decow only decrease short lived process performaqnce. 
> because long lived process do decowing anyway before exec(2).
> 
> and, All DB application is definitely long lived process.
> then early decow don't cause degression.

Frankly, linus sugessted to insert one branch into do_wp_page(), 
but I remove one branch from gup_fast.

I think it's good performance trade-off.
but if anybody hate my approach, I'll drop my chicken heart and
try to linus suggested way.





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-23  0:13                             ` KOSAKI Motohiro
@ 2009-03-23 16:29                               ` Ingo Molnar
  2009-03-23 16:46                                 ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Ingo Molnar @ 2009-03-23 16:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Nick Piggin, Linus Torvalds, Andrea Arcangeli,
	Benjamin Herrenschmidt, Nick Piggin, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-mm


* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > following patch is my v2 approach.
> > it survive Andrea's three dio test-case.
> >
> > [...]

> Frankly, linus sugessted to insert one branch into do_wp_page(), 
> but I remove one branch from gup_fast.
> 
> I think it's good performance trade-off. but if anybody hate my 
> approach, I'll drop my chicken heart and try to linus suggested 
> way.

We started out with a difficult corner case problem (for an arguably 
botched syscall promise we made to user-space many moons ago), and 
an invasive and unmaintainable looking patch:

    8 files changed, 342 insertions(+), 77 deletions(-)

And your v2 is now:

    9 files changed, 66 insertions(+), 21 deletions(-)

... and it is also speeding up fast-gup. Which is a marked 
improvement IMO.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-23 16:29                               ` Ingo Molnar
@ 2009-03-23 16:46                                 ` Linus Torvalds
  2009-03-24  5:08                                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 83+ messages in thread
From: Linus Torvalds @ 2009-03-23 16:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KOSAKI Motohiro, Nick Piggin, Andrea Arcangeli,
	Benjamin Herrenschmidt, Nick Piggin, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-mm



On Mon, 23 Mar 2009, Ingo Molnar wrote:
> 
> And your v2 is now:
> 
>     9 files changed, 66 insertions(+), 21 deletions(-)
> 
> ... and it is also speeding up fast-gup. Which is a marked 
> improvement IMO.

Yeah, I have no problems with that patch. I'd just suggest a final 
simplification, and getting rid of the

        mask = _PAGE_PRESENT|_PAGE_USER;
        /* Maybe the read only pte is cow mapped page. (or not maybe)
           So, falling back to get_user_pages() is better */
        mask |= _PAGE_RW;

and just doing something like

	/*
	 * fast-GUP only handles the simple cases where we have
	 * full access to the page (ie private pages are copied
	 * etc).
	 */
	#define GUP_MASK (_PAGE_PRESENT|_PAGE_USER|_PAGE_RW)

and leaving it at that.

Of course, maybe somebody does O_DIRECT writes on a fork'ed image in order 
to create a snapshot image or something, and now the v2 thing breaks COW 
on all the pages in order to be safe and performance sucks.

But I can't really say that _I_ could possibly care. I really seriously 
think that O_DIRECT and its ilk were braindamaged to begin with.

				Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-23 16:46                                 ` Linus Torvalds
@ 2009-03-24  5:08                                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 83+ messages in thread
From: KOSAKI Motohiro @ 2009-03-24  5:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kosaki.motohiro, Ingo Molnar, Nick Piggin, Andrea Arcangeli,
	Benjamin Herrenschmidt, Nick Piggin, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-mm

Hi

> > And your v2 is now:
> > 
> >     9 files changed, 66 insertions(+), 21 deletions(-)
> > 
> > ... and it is also speeding up fast-gup. Which is a marked 
> > improvement IMO.
> 
> Yeah, I have no problems with that patch. I'd just suggest a final 
> simplification, and getting rid of the
> 
>         mask = _PAGE_PRESENT|_PAGE_USER;
>         /* Maybe the read only pte is cow mapped page. (or not maybe)
>            So, falling back to get_user_pages() is better */
>         mask |= _PAGE_RW;
> 
> and just doing something like
> 
> 	/*
> 	 * fast-GUP only handles the simple cases where we have
> 	 * full access to the page (ie private pages are copied
> 	 * etc).
> 	 */
> 	#define GUP_MASK (_PAGE_PRESENT|_PAGE_USER|_PAGE_RW)

OK! I'll do that.
Thanks good reviewing!


> and leaving it at that.
> 
> Of course, maybe somebody does O_DIRECT writes on a fork'ed image in order 
> to create a snapshot image or something, and now the v2 thing breaks COW 
> on all the pages in order to be safe and performance sucks.
> 
> But I can't really say that _I_ could possibly care. I really seriously 
> think that O_DIRECT and its ilk were braindamaged to begin with.

Yes. I have to totally agreed ;)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-22 12:23                           ` KOSAKI Motohiro
  2009-03-23  0:13                             ` KOSAKI Motohiro
@ 2009-03-24 13:43                             ` Nick Piggin
  2009-03-24 17:56                               ` Linus Torvalds
  2009-03-30 10:52                               ` KOSAKI Motohiro
  1 sibling, 2 replies; 83+ messages in thread
From: Nick Piggin @ 2009-03-24 13:43 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Andrea Arcangeli, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

On Sunday 22 March 2009 23:23:56 KOSAKI Motohiro wrote:
> Hi
>
> following patch is my v2 approach.
> it survive Andrea's three dio test-case.
>
> Linus suggested to change add_to_swap() and shrink_page_list() stuff
> for avoid false cow in do_wp_page() when page become to swapcache.
>
> I think it's good idea. but it's a bit radical. so I think it's for
> development tree tackle.
>
> Then, I decide to use Nick's early decow in
> get_user_pages() and RO mapped page don't use gup_fast.

You probably should be testing for PageAnon pages in gup_fast.
Also, using a bit in page->flags you could potentially get
anonymous, readonly mappings working again (I thought I had
them working in my patch, but on second thoughts perhaps I
had a bug in tagging them, I'll try to fix that).


> yeah, my approach is extream brutal way and big hammer. but I think
> it don't have performance issue in real world.
>
> why?
>
> Practically, we can assume following two thing.
>
> (1) the buffer of passed write(2) syscall argument is RW mapped
>     page or COWed RO page.
>
> if anybody write following code, my path cause performance degression.
>
>    buf = mmap()
>    memset(buf, 0x11, len);
>    mprotect(buf, len, PROT_READ)
>    fd = open(O_DIRECT)
>    write(fd, buf, len)
>
> but it's very artifactical code. nobody want this.
> ok, we can ignore this.

The more interesting uses of gup (and perhaps somewhat
improved or enabled with fast-gup) I think are things like
vmsplice, and syslets/threadlets/aio kind of things. And I
don't exactly know what the users are going to look like.


> (2) DirectIO user process isn't short lived process.
>
> early decow only decrease short lived process performaqnce.
> because long lived process do decowing anyway before exec(2).
>
> and, All DB application is definitely long lived process.
> then early decow don't cause degression.

Right, most databases won't care *at all* because they won't
do any decowing. But if there are cases that do care, then we
can perhaps take the policy of having them use MADV_DONTFORK
or somesuch.


> TODO
>   - implement down_write_killable().
>     (but it isn't important thing because this is rare case issue.)
>   - implement non x86 portion.
>
>
> Am I missing any thing?

I still don't understand why this way is so much better than
my last proposal. I just wanted to let that simmer down for a 
few days :) But I'm honestly really just interested in a good
discussion and I don't mind being sworn at if I'm being stupid,
but I really want to hear opinions of why I'm wrong too.

Yes my patch has downsides I'm quite happy to admit. But I just
don't see that copy-on-fork rather than wrprotect-on-fork is
the showstopper. To me it seemed nice because it is practically
just reusing code straight from do_wp_page, and pretty well
isolated out of the fastpath.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-24 13:43                             ` Nick Piggin
@ 2009-03-24 17:56                               ` Linus Torvalds
  2009-03-30 10:52                               ` KOSAKI Motohiro
  1 sibling, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2009-03-24 17:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: KOSAKI Motohiro, Andrea Arcangeli, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm



On Wed, 25 Mar 2009, Nick Piggin wrote:
> 
> I still don't understand why this way is so much better than
> my last proposal.

Take a look at the diffstat.

		Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
  2009-03-24 13:43                             ` Nick Piggin
  2009-03-24 17:56                               ` Linus Torvalds
@ 2009-03-30 10:52                               ` KOSAKI Motohiro
       [not found]                                 ` <200904022307.12043.nickpiggin@yahoo.com.au>
  1 sibling, 1 reply; 83+ messages in thread
From: KOSAKI Motohiro @ 2009-03-30 10:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, Linus Torvalds, Andrea Arcangeli,
	Benjamin Herrenschmidt, Ingo Molnar, Nick Piggin, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-mm

Hi Nick,

> > Am I missing any thing?
> 
> I still don't understand why this way is so much better than
> my last proposal. I just wanted to let that simmer down for a 
> few days :) But I'm honestly really just interested in a good
> discussion and I don't mind being sworn at if I'm being stupid,
> but I really want to hear opinions of why I'm wrong too.
> 
> Yes my patch has downsides I'm quite happy to admit. But I just
> don't see that copy-on-fork rather than wrprotect-on-fork is
> the showstopper. To me it seemed nice because it is practically
> just reusing code straight from do_wp_page, and pretty well
> isolated out of the fastpath.

Firstly, I'm very sorry for very long delay responce. This month, I'm
very busy and I don't have enough developing time ;)

Secondly, I have strongly obsession to bugfix. (I guess you alread know it)
but I don't have obsession to bugfix _way_. my patch was made for
creating good discussion, not NAK your patch.

I think your patch is good. but it have few disadvantage.
(yeah, I agree mine have lot disadvantage)

1. using page->flags
   nowadays, page->flags is one of most prime estate in linux.
   as far as possible, we can avoid to use it.
2. don't have GUP_FLAGS_PINNING_PAGE flag
   then, access_process_vm() can decow a page unnecessary.
   it isn't good feature, I think.

   IOW, I don't think "caller transparent" is important.
   minimal side effect is important more. my side-effect mean non direct-io
   effection. I don't mind direct-io path side effection. it is only used DB or
   similar software. then, we can assume a lot of userland usage.


and I was playing your patch in last week. but I conclude I can't shrink
it more.
As far as I understand, Linus don't refuse copy-on-fork itself. he only
refuse messy bugfix patch.
In general, bugfix patch should be backportable to stable tree.

Then, I think step-by-step development is better.

1. at first, merge wrprotect-on-fork.
2. improve speed.

What do you think?


btw,
Linus give me good inspiration. if page pinning happend, the patch
is guranteed to grabbed only one process.
then, we can put pinning-count and some additional information
into anon_vma. it can avoid to use page->flags although we implement 
copy-on-fork. maybe.


HOWEVER, if you really hate my approach, please don't hesitate to tell it.
I don't hope submit your disliked patch. I respect linus, but I respect you too.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix]
       [not found]                                 ` <200904022307.12043.nickpiggin@yahoo.com.au>
@ 2009-04-03  3:49                                   ` Nick Piggin
  0 siblings, 0 replies; 83+ messages in thread
From: Nick Piggin @ 2009-04-03  3:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linus Torvalds, Andrea Arcangeli, Benjamin Herrenschmidt,
	Ingo Molnar, Nick Piggin, Hugh Dickins, KAMEZAWA Hiroyuki,
	linux-mm

[sorry, resending because my mail client started sending HTML and
this didn't get through spam filters]

On Thursday 02 April 2009 23:07:11 Nick Piggin wrote:
Hi!

On Monday 30 March 2009 21:52:44 KOSAKI Motohiro wrote:
> > Hi Nick,
>
> > > Am I missing any thing?
> >
> > I still don't understand why this way is so much better than
> > my last proposal. I just wanted to let that simmer down for a
> > few days :) But I'm honestly really just interested in a good
> > discussion and I don't mind being sworn at if I'm being stupid,
> > but I really want to hear opinions of why I'm wrong too.
> >
> > Yes my patch has downsides I'm quite happy to admit. But I just
> > don't see that copy-on-fork rather than wrprotect-on-fork is
> > the showstopper. To me it seemed nice because it is practically
> > just reusing code straight from do_wp_page, and pretty well
> > isolated out of the fastpath.
>
> Firstly, I'm very sorry for very long delay responce. This month, I'm
> very busy and I don't have enough developing time ;)

No problem.


> Secondly, I have strongly obsession to bugfix. (I guess you alread know 
it)
> but I don't have obsession to bugfix _way_. my patch was made for
> creating good discussion, not NAK your patch.

Definitely. I like more discussion and alternative approaches.


> I think your patch is good. but it have few disadvantage.
> (yeah, I agree mine have lot disadvantage)
>
> 1. using page->flags
>    nowadays, page->flags is one of most prime estate in linux.
>    as far as possible, we can avoid to use it.

Well... I'm not sure if it is that bad. It uses an anonymous
page flag, which are not so congested as pagecache page flags.
I can't think of anything preventing anonymous pages from
using PG_owner_priv_1, PG_private, or PG_mappedtodisk, so a
"final" solution that uses a page flag would use one of those
I guess.


> 2. don't have GUP_FLAGS_PINNING_PAGE flag
>    then, access_process_vm() can decow a page unnecessary.
>    it isn't good feature, I think.

access_process_vm I think can just avoid COWing because it
holds mmap_sem for the duration of the operation. I just didn't
fix that because I didn't really think of it.


>    IOW, I don't think "caller transparent" is important.

Well I don't know about that. I don't know that O_DIRECT is particularly
more important to fix the problem than vmsplice, or any of the numerous
other zero-copy methods open coded in drivers.


>    minimal side effect is important more. my side-effect mean non direct-
io
>    effection. I don't mind direct-io path side effection. it is only used
> DB or similar software. then, we can assume a lot of userland usage.

I agree my patch should not be de-cowing for access_process_vm for read.
I think that can be fixed.
 
But I disagree that O_DIRECT is unimportant. I think the big database users
don't like more cost in this path, and they obviously have the capacity to
use it carefully so I'm sure they would prefer not to add anything. Intel
definitely counts cycles in the O_DIRECT path.


> and I was playing your patch in last week. but I conclude I can't shrink
> it more.
> As far as I understand, Linus don't refuse copy-on-fork itself. he only
> refuse messy bugfix patch.
> In general, bugfix patch should be backportable to stable tree.

I think assessing this type of patch based of diffstat is a bit
ridiculous ;) But I think it can be shrunk a bit if it shares a
bit of code with do_wp_page.


> Then, I think step-by-step development is better.
>
> 1. at first, merge wrprotect-on-fork.
> 2. improve speed.
>
> What do you think?
>
>
> btw,
> Linus give me good inspiration. if page pinning happend, the patch
> is guranteed to grabbed only one process.
> then, we can put pinning-count and some additional information
> into anon_vma. it can avoid to use page->flags although we implement
> copy-on-fork. maybe.

Hmm, I might try playing with that in my patch. Not so much because the
extra flag is important (as I explain above), but keeping a count will
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2009-04-03  3:49 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20090311170611.GA2079@elte.hu>
2009-03-11 17:33 ` [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Linus Torvalds
2009-03-11 17:41   ` Ingo Molnar
2009-03-11 17:58     ` Linus Torvalds
2009-03-11 18:37       ` Andrea Arcangeli
2009-03-11 18:46         ` Linus Torvalds
2009-03-11 19:01           ` Linus Torvalds
2009-03-11 19:59             ` Andrea Arcangeli
2009-03-11 20:19               ` Linus Torvalds
2009-03-11 20:33                 ` Linus Torvalds
2009-03-11 20:55                   ` Andrea Arcangeli
2009-03-11 21:28                     ` Linus Torvalds
2009-03-11 21:57                       ` Andrea Arcangeli
2009-03-11 22:06                         ` Linus Torvalds
2009-03-11 22:07                           ` Linus Torvalds
2009-03-11 22:22                           ` Davide Libenzi
2009-03-11 22:32                             ` Linus Torvalds
2009-03-14  5:07                   ` Benjamin Herrenschmidt
2009-03-11 20:48                 ` Andrea Arcangeli
2009-03-14  5:06                 ` Benjamin Herrenschmidt
2009-03-14  5:20                   ` Nick Piggin
2009-03-16 16:01                     ` KOSAKI Motohiro
2009-03-16 16:23                       ` Nick Piggin
2009-03-16 16:32                         ` Linus Torvalds
2009-03-16 16:50                           ` Nick Piggin
2009-03-16 17:02                             ` Linus Torvalds
2009-03-16 17:19                               ` Nick Piggin
2009-03-16 17:42                                 ` Linus Torvalds
2009-03-16 18:02                                   ` Nick Piggin
2009-03-16 18:05                                     ` Nick Piggin
2009-03-16 18:17                                       ` Linus Torvalds
2009-03-16 18:33                                         ` Nick Piggin
2009-03-16 19:22                                           ` Linus Torvalds
2009-03-17  5:44                                             ` Nick Piggin
2009-03-16 18:14                                     ` Linus Torvalds
2009-03-16 18:29                                       ` Nick Piggin
2009-03-16 19:17                                         ` Linus Torvalds
2009-03-17  5:42                                           ` Nick Piggin
2009-03-17  5:58                                             ` Nick Piggin
2009-03-16 18:37                                       ` Andrea Arcangeli
2009-03-16 18:28                                   ` Andrea Arcangeli
2009-03-16 23:59                             ` KAMEZAWA Hiroyuki
2009-03-18  2:04                         ` KOSAKI Motohiro
2009-03-22 12:23                           ` KOSAKI Motohiro
2009-03-23  0:13                             ` KOSAKI Motohiro
2009-03-23 16:29                               ` Ingo Molnar
2009-03-23 16:46                                 ` Linus Torvalds
2009-03-24  5:08                                   ` KOSAKI Motohiro
2009-03-24 13:43                             ` Nick Piggin
2009-03-24 17:56                               ` Linus Torvalds
2009-03-30 10:52                               ` KOSAKI Motohiro
     [not found]                                 ` <200904022307.12043.nickpiggin@yahoo.com.au>
2009-04-03  3:49                                   ` Nick Piggin
2009-03-17  0:44                       ` Linus Torvalds
2009-03-17  0:56                         ` KAMEZAWA Hiroyuki
2009-03-17 12:19                         ` Andrea Arcangeli
2009-03-17 16:43                           ` Linus Torvalds
2009-03-17 17:01                             ` Linus Torvalds
2009-03-17 17:10                               ` Andrea Arcangeli
2009-03-17 17:43                                 ` Linus Torvalds
2009-03-17 18:09                                   ` Linus Torvalds
2009-03-17 18:19                                     ` Linus Torvalds
2009-03-17 18:46                                       ` Andrea Arcangeli
2009-03-17 19:03                                         ` Linus Torvalds
2009-03-17 19:35                                           ` Andrea Arcangeli
2009-03-17 19:55                                             ` Linus Torvalds
2009-03-11 19:06           ` Andrea Arcangeli
2009-03-12  5:36           ` Nick Piggin
2009-03-12 16:23             ` Nick Piggin
2009-03-12 17:00               ` Andrea Arcangeli
2009-03-12 17:20                 ` Nick Piggin
2009-03-12 17:23                   ` Nick Piggin
2009-03-12 18:06                   ` Andrea Arcangeli
2009-03-12 18:58                     ` Andrea Arcangeli
2009-03-13 16:09                     ` Nick Piggin
2009-03-13 19:34                       ` Andrea Arcangeli
2009-03-14  4:59                         ` Nick Piggin
2009-03-16 13:56                           ` Andrea Arcangeli
2009-03-16 16:01                             ` Nick Piggin
2009-03-14  4:46                       ` Nick Piggin
2009-03-14  5:06                         ` Nick Piggin
2009-03-11 18:53     ` Andrea Arcangeli
2009-03-11 18:22   ` Andrea Arcangeli
2009-03-11 19:06     ` Ingo Molnar
2009-03-11 19:15       ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.